The Ames Housing Dataset is a detailed record of real estate information across properties located in Ames, Iowa, over transactions occurring in the period between 2006 and 2010. With 1,460 entries and 81 diverse features, this data set provides a sweeping view of the features of residential real estate, which will give details about zoning classification, lot dimensions, street access, and utilities. It extends to such details as quality and condition of the houses, year built, amenities, and physical and functional attributes of the properties. Covering a wide range of information from structural characteristics to neighbourhood features, this dataset is critical for carrying out detailed real estate market studies, further allowing statistical investigations and research purposes in price prediction or economic trends in the housing market. This serves to be an exemplary tool for real estate developers, economic forecasters, and academia in their urban planning and property valuation disciplines.
The Ames Housing data is one of the few alternatives to the Boston Housing data, which is commonly used when teaching regression analysis in a course setting. This is a very rich dataset, with detailed, rich insights into the real estate market in Ames, Iowa from 2006 to 2010. It’s a tabular data of 1460 observations with 79 explanatory variables concerning various facets of residential properties. These variables range from the architectural specification of the material used to the condition of various components of a house and its environment, targeting to predict selling prices of homes.
To effectively analyze the Ames Housing dataset and derive meaningful insights, expertise in several key domains is crucial:
Real Estate Market Trends: Understanding both general and local real estate market trends in Ames, Iowa, including average pricing, popular buying areas, buyer preferences, and seasonal buying patterns.
Construction and Housing Features: Familiarity with various construction elements such as foundation types, roofing materials, and exterior features, and how these affect property durability and pricing.
Seasonal Impact on Sales: Awareness of seasonal variations in real estate transactions in the U.S., including the busiest seasons (spring and summer) with increased demand and peak prices, as well as the relatively slower seasons (fall and winter) with softer prices and longer time on the market.
Zoning and Regulatory Compliance: Knowledge of local land-use regulations and zoning laws that can influence real estate development projects and property values.
Economic Indicators: Understanding of local economic conditions affecting the housing market, including employment rates, average incomes, and measures of economic growth.
The determination of real estate value in the Ames Housing dataset relies on several key variables:
Physical Features: Variables like lot area, overall quality, overall condition, and year built directly influence property valuation based on the quality of materials, finish, and age of construction.
Location: The variable Neighborhood categorizes houses into various parts of Ames, impacting price due to location desirability and local amenities.
Size and Space: Features such as above-ground living area square footage (GrLivArea) and total basement square footage (TotalBsmtSF) are crucial indicators of property size and space.
Amenities: Factors like the presence of fireplaces, garage size (GarageCars), and whether the property has a pool (PoolQC) contribute significantly to property value by enhancing amenities and lifestyle.
Renovations and Upgrades: The remodel date (YearRemodAdd) is important, indicating recent changes or improvements that could materially impact the sale price, highlighting the significance of renovations and upgrades in determining property value.
Based on the above domain knowledge and dataset understanding, several analytical questions can be formulated:
How do external features such as proximity and lot area influence the sale price of homes in Ames?
What effects do renovations have on the sale price of a house?
How does energy efficiency and utilities impact the sale price of a house?
What is the impact of landscape and outdoor features on the sale price of a house?
How do neighborhood amenities affect the sale price of a house?
How do market dynamics influence the sale price of a house?
How does seasonal trends affect sale price of houses in Ames?
How do quality and condition of a house impact Sale Price of Houses in Ames?
What is the relationship between having a garage and the Sale Price of Houses in Ames?
Addressing these questions through detailed data analysis will allow for effective modeling and prediction of housing prices, providing valuable insights for potential buyers, sellers, and real estate professionals in Ames, Iowa.
The preprocessing of data for the Ames Housing dataset is crucial for accurate analysis. This involves cleaning the data and addressing missing values to ensure the integrity of predictive modeling outcomes. Anomalies in the data could significantly impact the results, making thorough preprocessing essential for reliable analysis.
The datasets “train.csv” and “test.csv” are imported into R using the
read.csv function, ensuring all necessary
data is successfully loaded. The dataset “train.csv” is specifically
loaded into the variable named “ameshous_train_data” for further
exploration and processing.
ameshous_train_data <- read.csv("datasets/train.csv")
ameshous_test_data <- read.csv("datasets/test.csv")
summary(ameshous_train_data)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
str(ameshous_train_data)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
head(ameshous_train_data)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
dim(ameshous_train_data)
## [1] 1460 81
colnames(ameshous_train_data)
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
view(ameshous_train_data)
The code utilizes dplyr and
tidyr to summarize missing values in the
“ameshous_train_data” dataset. It calculates NA counts for each column,
then transforms the data into a long format with variables and their
corresponding missing value counts. This process streamlines analysis,
highlighting variables like PoolQC, MiscFeature, Alley, and Fence with
the most missing values (all labeled as “NAs”).
missing_values <- ameshous_train_data %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "MissingCount") %>%
filter(MissingCount > 0) %>%
arrange(desc(MissingCount))
missing_values <- as.data.frame(missing_values)
missing_values <- missing_values[missing_values$MissingCount > 0, ]
missing_data_df <- data.frame(Variable = missing_values$Variable, MissingCount = missing_values$MissingCount)
print(missing_values)
## Variable MissingCount
## 1 PoolQC 1453
## 2 MiscFeature 1406
## 3 Alley 1369
## 4 Fence 1179
## 5 FireplaceQu 690
## 6 LotFrontage 259
## 7 GarageType 81
## 8 GarageYrBlt 81
## 9 GarageFinish 81
## 10 GarageQual 81
## 11 GarageCond 81
## 12 BsmtExposure 38
## 13 BsmtFinType2 38
## 14 BsmtQual 37
## 15 BsmtCond 37
## 16 BsmtFinType1 37
## 17 MasVnrType 8
## 18 MasVnrArea 8
## 19 Electrical 1
This code snippet in R uses ggplot2 to create a bar chart displaying missing data counts for each variable in the “missing_data_df” dataframe. The x-axis is ordered by missing counts, bars are blue with a white border, and labeled with counts using geom_text. A gradient fill effect highlights variables with higher missing counts.
ggplot(missing_data_df, aes(x = reorder(Variable, -MissingCount), y = MissingCount)) +
geom_bar(stat = "identity", fill = "blue", color = "white") +
geom_text(aes(label = MissingCount), vjust = -0.3, color = "black", size = 3.5) +
scale_fill_gradient(low = "lightblue", high = "blue", name = "Missing Count") +
labs(
title = "Missing Data Counts by Variable",
subtitle = "Total counts of missing entries for each variable in the dataset",
x = "Variable",
y = "Number of Missing Values"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12),
axis.title = element_text(size = 14),
axis.text.x = element_text(angle = 45, hjust = 1, size = 12, color = "gray50"),
axis.text.y = element_text(size = 12, color = "gray50"),
legend.position = "none",
plot.margin = unit(c(10, 10, 10, 10), "pt")
)
The R code below cleans the missing values from the dataset and saves the cleaned data as “amesclean_train_data.csv”.
features_none = c("Alley", "MasVnrType", "BsmtQual", "BsmtCond", "BsmtExposure",
"BsmtFinType1", "BsmtFinType2", "FireplaceQu", "GarageType",
"GarageFinish", "GarageQual", "GarageCond", "PoolQC", "Fence", "MiscFeature","LotFrontage")
for (feature in features_none) {
ameshous_train_data[[feature]][is.na(ameshous_train_data[[feature]])] <- "None"
}
ameshous_train_data$MasVnrArea[is.na(ameshous_train_data$MasVnrArea)] <- 0
ameshous_train_data$GarageYrBlt[is.na(ameshous_train_data$GarageYrBlt)] <- ameshous_train_data$YearBuilt[is.na(ameshous_train_data$GarageYrBlt)]
mode_electrical <- names(which.max(table(ameshous_train_data$Electrical)))
ameshous_train_data$Electrical[is.na(ameshous_train_data$Electrical)] <- mode_electrical
missing_values_summary <- sapply(ameshous_train_data, function(x) sum(is.na(x)))
missing_columns <- names(missing_values_summary)[missing_values_summary > 0]
missing_values_df <- ameshous_train_data[, missing_columns]
print(missing_values_summary)
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 0 0
## Street Alley LotShape LandContour Utilities
## 0 0 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 0 0 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 0 0 0 0 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 0 0 0 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 0 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 0 0 0
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 0 0 0 0 0
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 0 0 0
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
write.csv(ameshous_train_data, "datasets/amesclean_train_data.csv", row.names = FALSE)
This R script visualizes the correlation matrix of numerical features from the “ameshous_train_data” dataset using ggplot2, corrplot, and reshape2. It extracts numeric columns, calculates the correlation matrix, reshapes it, and plots it as a tile plot with color indicating strength and direction of correlation. Text labels show numeric correlation values, and minimal styling enhances readability.
ames_housing <- read.csv("datasets/amesclean_train_data.csv")
ames_numeric <- ames_housing[sapply(ames_housing, is.numeric)]
cor_matrix <- cor(ames_numeric, use = "pairwise.complete.obs")
cor_melted <- melt(cor_matrix)
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white", linewidth = 0.2) +
geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3, vjust = 1) +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0, limit = c(-1, 1), name="Correlation") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
axis.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "gray95"),
plot.background = element_rect(color = "gray95", fill = "gray95")
) +
labs(title = "Correlation Matrix of Housing Features", subtitle = "Numeric features of the Ames Housing dataset")
In further analysis, the goal is to determine the top 10 highly correlated variables with the target variable “Sale Price.” It’s evident that “OverallCond” is among the highly correlated variables with “Sale Price.”
names(ameshous_train_data)
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
sale_price_correlations <- cor_matrix[,"SalePrice", drop = FALSE]
sorted_correlations <- sort(sale_price_correlations[,1], decreasing = TRUE)
top_correlations <- head(sorted_correlations[-1], 10)
cor_data <- data.frame(
Variable = names(top_correlations),
Correlation = top_correlations
)
cor_melted <- melt(cor_data, id.vars = "Variable")
ggplot(cor_data, aes(x = Variable, y = factor(1, levels = "SalePrice"), fill = Correlation)) +
geom_tile(color = "white", size = 0.5) +
geom_text(aes(label = sprintf("%.2f", Correlation)), color = "black", size = 5, vjust = 0.5) +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name="Correlation") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
plot.background = element_rect(fill = "white")
) +
labs(title = "Top 10 Variables Correlated with SalePrice", x = "Variables", y = "")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The histogram with a density plot overlay illustrates the distribution of sale prices in the Ames housing dataset. The plot reveals a concentration of values in the lower price range with a tail extending towards higher values, suggesting fewer houses at higher sale prices. The mean sale price is approximately $180,921.20, serving as the central point of the distribution. The right-skewed shape of the distribution indicates some houses with significantly higher prices, affecting the mean more than the median or mode.
ggplot(ames_housing, aes(x = SalePrice)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
geom_density(aes(y = after_stat(count * 30), color = "Density"), fill = "lightblue", alpha = 0.3) +
labs(title = "Distribution of Sale Prices", x = "Sale Price", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "bottom"
) +
scale_color_manual(values = c("Density" = "red")) +
guides(color = guide_legend(title = "Overlay"))
mean_sale_price <- mean(ames_housing$SalePrice)
print(mean_sale_price)
## [1] 180921.2
The boxplot with jittered data points displays the distribution of sale prices in the Ames housing dataset. It showcases the interquartile range, median line, and outliers, highlighting homes with significantly higher prices. Jittered points visually represent data density across price segments, discerning variability in housing prices and identifying outliers potentially due to unique features or desirable locations.
ggplot(ames_housing, aes(y = SalePrice)) +
geom_boxplot(fill = "lightblue", color = "black", alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
geom_jitter(aes(x = 1), color = "blue", alpha = 0.3, width = 0.1) +
labs(y = "Sale Price") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title.y = element_text(size = 14)
) +
ggtitle("Distribution of Sale Price")
Correlation Matrix for Physical Features of House by Overall Quality, Overall Cond, and Year Built
ames_housing %>%
select(OverallQual, OverallCond, YearBuilt, RoofStyle, Exterior1st, Exterior2nd) %>%
summary()
## OverallQual OverallCond YearBuilt RoofStyle
## Min. : 1.000 Min. :1.000 Min. :1872 Length:1460
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 Class :character
## Median : 6.000 Median :5.000 Median :1973 Mode :character
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
## Exterior1st Exterior2nd
## Length:1460 Length:1460
## Class :character Class :character
## Mode :character Mode :character
##
##
##
ggplot(ames_housing, aes(x = OverallQual)) +
geom_histogram(binwidth = 1, fill = "blue") +
labs(title = "Distribution of Overall Quality Ratings")
physical_features <- ames_housing %>%
select(OverallQual, OverallCond, YearBuilt)
cor_physical <- cor(physical_features, use = "complete.obs")
ggplot(melt(cor_physical), aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(midpoint = 0, low = "blue", high = "red", mid = "white") +
theme_minimal() +
labs(title = "Correlation Matrix for Physical House Features")
Exploratory Data Analysis (EDA)
How do external features such as proximity and lot area influence the sale price in Ames IOWA?
In Ames, Iowa’s housing market, external factors like lot area and proximity to various conditions significantly influence sale prices. Larger lots generally command higher prices, though other factors play crucial roles. Properties near negative externalities exhibit lower prices and greater variability, while those in desirable locales command higher prices. The interaction between house style and proximity further underscores these dynamics, revealing the nuanced impact of external features on real estate values.
ggplot(ames_housing, aes(x = SalePrice)) +
geom_histogram(bins = 50, fill = "cornflowerblue", color = "black") +
labs(title = "Distribution of Sale Prices",
x = "Sale Price",
y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Comparing Lot Area with Sale Price
The scatter plot shows lot area against sale price in the Ames dataset. Blue points represent properties, with transparency indicating data concentration. A red regression line suggests a positive trend, but wide point spread implies weak correlation. Light blue shading around the line represents the 95% confidence interval, highlighting considerable variability beyond lot area’s influence.
ggplot(ames_housing, aes(x = LotArea, y = SalePrice)) +
geom_point(alpha = 0.6, color = "blue") +
geom_smooth(method = "lm", color = "red", se = TRUE, fill = "lightblue", alpha = 0.2) +
labs(title = "Sale Price vs. Lot Area",
x = "Lot Area (sq feet)",
y = "Sale Price") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
## `geom_smooth()` using formula = 'y ~ x'
Density of Sale Prices by Major Roadway Proximity (Condition1)
The plot illustrates sale price density in the Ames dataset, segmented by proximity to major roadways (Condition1). Each color represents a different condition, showcasing sale price distributions. Properties near undesirable features like major roads or railroads exhibit varied pricing, possibly indicating lower values due to negative factors. Conversely, normal or positively noted areas show narrower and higher-peaked distributions, reflecting higher median prices and less variance. This visual insight elucidates the impact of roadway proximity on property values.
ggplot(ames_housing, aes(x = SalePrice, fill = Condition1)) +
geom_density(alpha = 0.7) +
labs(title = "Density of Sale Prices by Major Roadway Proximity (Condition1)",
x = "Sale Price",
y = "Density") +
scale_fill_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
legend.position = "bottom",
legend.title = element_blank(),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Sale Prices by Condition1 and Condition2
The plot illustrates sale prices in the Ames dataset, categorized by Condition1 and Condition2 (proximity to features like major roadways). Subplots for each Condition1 category (‘Artery’, ‘Feeder’, ‘Norm’) contain boxplots for different Condition2 categories, showing median, quartiles, and range. ‘PosN’ category under Condition1 exhibits higher median prices and less variability compared to ‘Artery’ or ‘Feeder’, indicating proximity to major roadways lowers prices. This visualization elucidates how environmental conditions interact, influencing sale prices.
colors <- brewer.pal(9, "Set1")
ggplot(ames_housing, aes(x = Condition1, y = SalePrice, fill = Condition1)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
facet_wrap(~ Condition2, scales = "free_x", nrow = 2, labeller = label_both) +
labs(title = "Sale Prices by Condition1 and Condition2",
x = "Condition1",
y = "Sale Price") +
scale_fill_manual(values = colors) +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
strip.text = element_text(size = 12, face = "bold")
)
Boxplot Sale Prices by Proximity to Major Roadwasy (Condition1)
The boxplot illustrates sale price distribution relative to proximity to major roadways and railroads (Condition1) in the Ames housing dataset. Categories like ‘Artery’ and ‘Feeder’ (close to major and minor roads) tend to have lower median sale prices and broader ranges, indicating variability in buyer valuation. Conversely, ‘PosN’ (positive near) and ‘RRNn’ (near north railroad) exhibit higher median prices and narrower interquartile ranges, suggesting higher buyer valuation. Outliers, especially in ‘Norm’ and ‘PosN’, may indicate unique features or better conditions. This visualization highlights how environmental factors impact real estate values, with less trafficked areas generally commanding higher prices.
ggplot(ames_housing, aes(x = Condition1, y = SalePrice)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3, fill = "lightblue", color = "black", alpha = 0.7) +
labs(title = "Sale Prices by Proximity to Major Roadways (Condition1)",
x = "Condition",
y = "Sale Price") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Interaction of House Style, Condition 1 and Sale Price of House
The plot demonstrates the interaction between house style and proximity to conditions (Condition1) on sale prices in the Ames dataset. Trend lines for different styles show varying price changes across conditions. For example, ‘2.5 Finished’ homes generally decrease in price near major arterials but hold higher values in favorable conditions like ‘PosN’, indicating the significant impact of location on real estate valuation.
ggplot(ames_housing, aes(x = Condition1, y = SalePrice, color = HouseStyle)) +
geom_point(alpha = 0.6) +
geom_smooth(aes(group = HouseStyle), method = "lm", se = FALSE) +
labs(title = "Interaction of House Style and Condition1 on Sale Prices",
x = "Condition1",
y = "Sale Price") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
## `geom_smooth()` using formula = 'y ~ x'
What effects do renovations have on the Sale Price of a house in Ames IOWA?
Visualizations demonstrate that renovations positively impact sale prices in Ames, Iowa. Density plots show renovated homes fetch higher prices with more concentrated distributions, indicating higher median prices and increased market appeal. Scatter plots reveal renovated homes maintain higher values as they age compared to non-renovated ones. Boxplots by neighborhood confirm renovated homes command higher median prices, with significant premiums in desirable areas like StoneBr, NridgHt, and NoRidge. Variability in sale prices among renovated homes underscores differences in renovation extent and quality, affecting overall investment return. These analyses highlight renovations as a crucial factor in enhancing property values and a beneficial investment in the Ames housing market.
ames_housing$Renovated <- ifelse(ames_housing$YearRemodAdd > ames_housing$YearBuilt, "Renovated", "Not Renovated")
renovation_summary <- ames_housing %>%
group_by(Renovated) %>%
summarise(
Count = n(),
Mean = mean(SalePrice, na.rm = TRUE),
Median = median(SalePrice, na.rm = TRUE),
SD = sd(SalePrice, na.rm = TRUE)
)
print(renovation_summary)
## # A tibble: 2 × 5
## Renovated Count Mean Median SD
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Not Renovated 764 182584. 170000 70334.
## 2 Renovated 696 179096. 155000 88383.
Density Plot of Sale Price by Renovation Status of House
This density plot compares sale price distributions between renovated and non-renovated homes in Ames. Renovated homes generally show higher sale prices, with a sharper peak around a higher median price compared to non-renovated ones, indicating the positive impact of renovations on property values.
ggplot(ames_housing, aes(x = SalePrice, fill = Renovated)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Sale Prices by Renovation Status",
x = "Sale Price",
y = "Density",
fill = "Renovated") +
scale_fill_manual(values = c("lightblue", "orange")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
t_test_result <- t.test(SalePrice ~ Renovated, data = ames_housing)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: SalePrice by Renovated
## t = 0.82895, df = 1326.2, p-value = 0.4073
## alternative hypothesis: true difference in means between group Not Renovated and group Renovated is not equal to 0
## 95 percent confidence interval:
## -4765.654 11740.359
## sample estimates:
## mean in group Not Renovated mean in group Renovated
## 182583.7 179096.3
Sale Price based on the age of house during renovation
The scatter plot illustrates the relationship between home age at sale and sale prices, categorized by renovation status. Blue dots represent non-renovated homes, while red dots depict renovated ones. Both show a decline in sale prices as home age increases, but the slope is steeper for non-renovated homes, indicating greater depreciation with age. Renovated homes maintain higher prices, especially for older properties, suggesting renovations mitigate age-related value declines. Variability in sale prices among renovated homes reflects differences in renovation extent and effectiveness in enhancing property value.
ames_housing$AgeAtSale <- ames_housing$YrSold - ames_housing$YearBuilt
ggplot(ames_housing, aes(x = AgeAtSale, y = SalePrice, color = Renovated)) +
geom_point(alpha = 0.6, size = 3) +
geom_smooth(method = "lm", se = FALSE, size = 1, aes(group = Renovated)) +
labs(title = "Sale Price vs. Age at Sale by Renovation Status",
x = "Age at Sale (Years)",
y = "Sale Price") +
scale_color_manual(values = c("Not Renovated" = "blue", "Renovated" = "red")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
## `geom_smooth()` using formula = 'y ~ x'
Sale Price of House by Neighborhood and Renovation Status
The boxplot visualizes sale prices of homes by neighborhood and renovation status in the Ames dataset. For each neighborhood, two boxplots are displayed side by side—orange for renovated homes and blue for non-renovated. Renovated homes generally exhibit higher median sale prices across almost all neighborhoods, notably in areas like StoneBr, NridgHt, and NoRidge, indicating a premium for upgrades in these locales. Additionally, the broader range of prices for renovated homes in several neighborhoods reflects varying impacts of renovations on property values, emphasizing the influence of neighborhood context on the return on investment in home upgrades.
ggplot(ames_housing, aes(x = Neighborhood, y = SalePrice, fill = Renovated)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
coord_flip() +
labs(title = "Sale Price by Neighborhood and Renovation Status",
x = "Neighborhood",
y = "Sale Price",
fill = "Renovated") +
scale_fill_manual(values = c("Not Renovated" = "lightblue", "Renovated" = "orange")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Impact of Renovations on Sale Price of the House
This boxplot compares sale prices between renovated (orange) and non-renovated (blue) homes, showing higher median prices for renovated ones, denoted by red diamonds. The broader interquartile range in the renovated group suggests greater variability, likely due to differences in renovation quality. Overall, the plot underscores renovations as a valuable investment for boosting property value.
renovated_data <- ames_housing %>% filter(Renovated == "Renovated")
cor_matrix_renovated <- cor(renovated_data[which(sapply(renovated_data, is.numeric))], use = "complete.obs")
print(cor_matrix_renovated)
## Id MSSubClass LotArea OverallQual OverallCond
## Id 1.0000000000 -0.006639610 -0.022016621 -0.060289631 0.037909806
## MSSubClass -0.0066396104 1.000000000 -0.125602271 0.062517468 -0.032679718
## LotArea -0.0220166213 -0.125602271 1.000000000 0.167084313 -0.009618699
## OverallQual -0.0602896311 0.062517468 0.167084313 1.000000000 -0.053667383
## OverallCond 0.0379098063 -0.032679718 -0.009618699 -0.053667383 1.000000000
## YearBuilt -0.0399568478 -0.073834531 0.075330772 0.547681042 -0.319710604
## YearRemodAdd -0.0640346042 -0.028065272 0.095042539 0.425911931 0.173148727
## MasVnrArea -0.0488956115 -0.009192766 0.207099473 0.479914944 -0.145184992
## BsmtFinSF1 -0.0391894319 -0.108023622 0.233642972 0.356704165 -0.057463383
## BsmtFinSF2 -0.0178375056 -0.105310750 0.060086317 -0.028052039 0.043679262
## BsmtUnfSF 0.0081393797 -0.037705412 -0.009724210 0.241439802 -0.164626963
## TotalBsmtSF -0.0396371046 -0.188452316 0.255173397 0.586746369 -0.198915206
## X1stFlrSF 0.0203323106 -0.176311812 0.334290895 0.535718231 -0.164060971
## X2ndFlrSF -0.0007198465 0.380722483 0.090442641 0.296775715 0.035730682
## LowQualFinSF -0.0671622956 0.073312677 0.011154863 -0.031385404 -0.012965561
## GrLivArea 0.0054387410 0.180230372 0.306014218 0.601955841 -0.089043685
## BsmtFullBath -0.0248979455 -0.064913692 0.128148279 0.137280006 -0.079082676
## BsmtHalfBath -0.0278538562 0.010634871 0.083906847 -0.008352248 0.147819316
## FullBath -0.0083031047 0.199835903 0.167773990 0.537519876 -0.187917809
## HalfBath -0.0175459828 0.131287802 0.052171698 0.301497106 -0.033796140
## BedroomAbvGr 0.0296183799 0.145950710 0.128185591 0.202243762 0.030732949
## KitchenAbvGr -0.0234018748 0.441858582 -0.031203353 -0.136340432 -0.066524785
## TotRmsAbvGrd 0.0050853618 0.199538860 0.197419943 0.455275344 -0.074249810
## Fireplaces 0.0084739046 -0.025726795 0.303203671 0.445477486 -0.046408969
## GarageYrBlt 0.0043305595 -0.036827986 0.063282117 0.502061406 -0.236380028
## GarageCars -0.0001871438 -0.030437865 0.224106130 0.608216689 -0.169377168
## GarageArea 0.0103609659 -0.074054376 0.228741545 0.556039810 -0.135474973
## WoodDeckSF -0.0457744475 -0.036648791 0.204054218 0.269674224 -0.012192706
## OpenPorchSF -0.0120882679 -0.006616687 0.110835032 0.261312933 -0.024727444
## EnclosedPorch 0.0061911531 0.047757220 -0.036208670 -0.143470852 0.031693529
## X3SsnPorch -0.0447543072 -0.064113377 0.041591085 -0.001150516 0.065915198
## ScreenPorch 0.0101420678 0.026662815 0.028175899 0.075581831 0.087323870
## PoolArea -0.0237825066 -0.013957198 0.037804421 0.032688407 -0.032863505
## MiscVal -0.0330447108 -0.017313856 0.024315317 -0.018628474 0.071427779
## MoSold 0.0209367886 -0.022297585 -0.002003056 0.056159875 0.032781713
## YrSold 0.0413814441 -0.034270415 -0.044487296 -0.046477145 0.075629822
## SalePrice -0.0351673374 -0.044642350 0.311834688 0.791809766 -0.076108324
## AgeAtSale 0.0415002160 0.072510112 -0.076975968 -0.549196907 0.322424570
## YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## Id -0.0399568478 -0.0640346042 -0.048895611 -0.039189432
## MSSubClass -0.0738345309 -0.0280652715 -0.009192766 -0.108023622
## LotArea 0.0753307718 0.0950425391 0.207099473 0.233642972
## OverallQual 0.5476810417 0.4259119312 0.479914944 0.356704165
## OverallCond -0.3197106041 0.1731487275 -0.145184992 -0.057463383
## YearBuilt 1.0000000000 0.5579918455 0.417227574 0.392280191
## YearRemodAdd 0.5579918455 1.0000000000 0.240193970 0.277006123
## MasVnrArea 0.4172275739 0.2401939702 1.000000000 0.375127090
## BsmtFinSF1 0.3922801914 0.2770061229 0.375127090 1.000000000
## BsmtFinSF2 0.0102314962 0.0413587001 -0.059037727 -0.048888982
## BsmtUnfSF 0.1048564929 0.0456373270 0.066218839 -0.468564107
## TotalBsmtSF 0.5083688997 0.3452779580 0.426826173 0.566093088
## X1stFlrSF 0.4113056771 0.3399610634 0.400193402 0.504982181
## X2ndFlrSF -0.0933264713 -0.0389362029 0.229647226 -0.128738855
## LowQualFinSF -0.1749403311 -0.1028627429 -0.079261050 -0.073077025
## GrLivArea 0.1947836187 0.1957290692 0.448958254 0.245631540
## BsmtFullBath 0.2788631184 0.2094954274 0.136274807 0.649662154
## BsmtHalfBath 0.0195111209 0.0589708784 0.037829061 0.079159252
## FullBath 0.4376355807 0.3232159118 0.354460770 0.168849004
## HalfBath 0.2508164621 0.1716973924 0.260020168 0.049209772
## BedroomAbvGr -0.0694496337 0.0003271869 0.109613808 -0.094221309
## KitchenAbvGr -0.2109171381 -0.1303134932 -0.036974643 -0.061503305
## TotRmsAbvGrd 0.0805454600 0.1307529209 0.309220169 0.073172683
## Fireplaces 0.2428563179 0.1104378714 0.268966858 0.285069181
## GarageYrBlt 0.7808115142 0.5122964118 0.355357650 0.294241975
## GarageCars 0.5472277342 0.3703407757 0.434128981 0.333575437
## GarageArea 0.4898972075 0.3392711943 0.441508799 0.375789219
## WoodDeckSF 0.2802516778 0.2469665673 0.233667116 0.239887490
## OpenPorchSF 0.1484619547 0.1472928602 0.099611110 0.145155194
## EnclosedPorch -0.4225391910 -0.2415215831 -0.158127597 -0.143636712
## X3SsnPorch 0.0525336628 0.0692085136 0.001692652 0.021099719
## ScreenPorch -0.0238284556 0.0120623637 0.038251562 0.035624870
## PoolArea -0.0136803012 0.0200061646 -0.008463973 0.051913386
## MiscVal -0.0167790183 0.0049087618 -0.025889577 0.004564695
## MoSold -0.0001492084 0.0222041172 -0.015331347 -0.021723756
## YrSold 0.0073899790 0.0931027149 -0.016194832 0.045788190
## SalePrice 0.5613064396 0.4457476914 0.592684037 0.500689216
## AgeAtSale -0.9992886157 -0.5542391451 -0.417657819 -0.390383608
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF
## Id -0.017837506 0.0081393797 -0.039637105 0.020332311
## MSSubClass -0.105310750 -0.0377054121 -0.188452316 -0.176311812
## LotArea 0.060086317 -0.0097242103 0.255173397 0.334290895
## OverallQual -0.028052039 0.2414398019 0.586746369 0.535718231
## OverallCond 0.043679262 -0.1646269635 -0.198915206 -0.164060971
## YearBuilt 0.010231496 0.1048564929 0.508368900 0.411305677
## YearRemodAdd 0.041358700 0.0456373270 0.345277958 0.339961063
## MasVnrArea -0.059037727 0.0662188393 0.426826173 0.400193402
## BsmtFinSF1 -0.048888982 -0.4685641072 0.566093088 0.504982181
## BsmtFinSF2 1.000000000 -0.2191904765 0.131710707 0.123333770
## BsmtUnfSF -0.219190477 1.0000000000 0.383122796 0.259588648
## TotalBsmtSF 0.131710707 0.3831227958 1.000000000 0.816100076
## X1stFlrSF 0.123333770 0.2595886484 0.816100076 1.000000000
## X2ndFlrSF -0.104029730 0.0529425716 -0.123017972 -0.110154595
## LowQualFinSF 0.024496444 0.0386635311 -0.028981451 -0.008517507
## GrLivArea 0.008977191 0.2279325569 0.473812433 0.615344047
## BsmtFullBath 0.184316599 -0.4020311866 0.359275189 0.304219937
## BsmtHalfBath 0.022968436 -0.0995440528 -0.004155133 -0.028073202
## FullBath -0.060610594 0.2539540931 0.392206552 0.459400155
## HalfBath -0.020889658 -0.0030778592 0.039654873 0.007522238
## BedroomAbvGr -0.043715442 0.1937302335 0.070179295 0.144486744
## KitchenAbvGr -0.048287367 0.0189400828 -0.064266234 -0.002570116
## TotRmsAbvGrd -0.051956262 0.2527588324 0.295782531 0.440271716
## Fireplaces 0.058216119 0.0561520129 0.370190960 0.460004164
## GarageYrBlt -0.001802506 0.1290552646 0.425599256 0.383168193
## GarageCars -0.003473848 0.1860054622 0.519726067 0.522270304
## GarageArea 0.007852866 0.1455919554 0.529212774 0.543036385
## WoodDeckSF 0.101763659 -0.0207645230 0.267390229 0.306111531
## OpenPorchSF 0.013583386 0.0967038544 0.247064638 0.221026630
## EnclosedPorch -0.032742899 0.0177872719 -0.143990779 -0.131021455
## X3SsnPorch -0.027151777 0.0002356145 0.011372989 0.052785139
## ScreenPorch 0.065055835 0.0177230130 0.079034840 0.093181436
## PoolArea 0.078378261 -0.0673267824 0.020071808 0.023841599
## MiscVal -0.017553525 -0.0287515737 -0.029523891 -0.037508446
## MoSold -0.059112690 0.0337958583 -0.013330567 0.001220686
## YrSold 0.044933717 -0.0838250855 -0.015021335 -0.029749520
## SalePrice 0.012493079 0.1805041689 0.693068292 0.685510964
## AgeAtSale -0.008532442 -0.1079724968 -0.508715453 -0.412249685
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Id -0.0007198465 -0.067162296 0.005438741 -0.024897946
## MSSubClass 0.3807224830 0.073312677 0.180230372 -0.064913692
## LotArea 0.0904426411 0.011154863 0.306014218 0.128148279
## OverallQual 0.2967757148 -0.031385404 0.601955841 0.137280006
## OverallCond 0.0357306819 -0.012965561 -0.089043685 -0.079082676
## YearBuilt -0.0933264713 -0.174940331 0.194783619 0.278863118
## YearRemodAdd -0.0389362029 -0.102862743 0.195729069 0.209495427
## MasVnrArea 0.2296472264 -0.079261050 0.448958254 0.136274807
## BsmtFinSF1 -0.1287388550 -0.073077025 0.245631540 0.649662154
## BsmtFinSF2 -0.1040297297 0.024496444 0.008977191 0.184316599
## BsmtUnfSF 0.0529425716 0.038663531 0.227932557 -0.402031187
## TotalBsmtSF -0.1230179723 -0.028981451 0.473812433 0.359275189
## X1stFlrSF -0.1101545952 -0.008517507 0.615344047 0.304219937
## X2ndFlrSF 1.0000000000 0.071149984 0.706106697 -0.189675239
## LowQualFinSF 0.0711499844 1.000000000 0.172292691 -0.054789653
## GrLivArea 0.7061066972 0.172292691 1.000000000 0.059800225
## BsmtFullBath -0.1896752391 -0.054789653 0.059800225 1.000000000
## BsmtHalfBath -0.0246742601 -0.013497582 -0.040475630 -0.136195635
## FullBath 0.4117367621 0.008561864 0.642375016 0.019440640
## HalfBath 0.5373903329 -0.030175941 0.417847566 -0.011698472
## BedroomAbvGr 0.5785577967 0.151401807 0.568210825 -0.099019938
## KitchenAbvGr 0.1406924830 0.011127107 0.108553193 -0.049122088
## TotRmsAbvGrd 0.6527557029 0.179144977 0.836663589 -0.035065599
## Fireplaces 0.2046079923 -0.035077316 0.476973811 0.123015140
## GarageYrBlt -0.0302644656 -0.089984746 0.234351994 0.201751509
## GarageCars 0.1473055987 -0.104511981 0.467727413 0.165608043
## GarageArea 0.1067700285 -0.068292563 0.455359450 0.202026100
## WoodDeckSF 0.0459586009 -0.026545423 0.247137941 0.168518634
## OpenPorchSF 0.1793530148 0.017376421 0.296173980 0.066650395
## EnclosedPorch 0.0955708126 0.050490384 -0.011689540 -0.110295667
## X3SsnPorch -0.0312375867 -0.007563255 0.011911167 0.003572285
## ScreenPorch 0.0689403622 0.038248170 0.123502130 -0.017347176
## PoolArea 0.0251401179 0.123236596 0.051381868 0.063381146
## MiscVal -0.0241323959 -0.008031946 -0.046003696 -0.033475603
## MoSold 0.0578180614 -0.029448701 0.042018183 -0.064977973
## YrSold -0.0751744197 -0.040035047 -0.084039695 0.112956054
## SalePrice 0.3034291010 -0.028278444 0.712605540 0.259817212
## AgeAtSale 0.0904509684 0.173354759 -0.197868807 -0.274482444
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Id -0.0278538562 -0.008303105 -0.017545983 0.0296183799
## MSSubClass 0.0106348715 0.199835903 0.131287802 0.1459507099
## LotArea 0.0839068465 0.167773990 0.052171698 0.1281855914
## OverallQual -0.0083522477 0.537519876 0.301497106 0.2022437617
## OverallCond 0.1478193160 -0.187917809 -0.033796140 0.0307329488
## YearBuilt 0.0195111209 0.437635581 0.250816462 -0.0694496337
## YearRemodAdd 0.0589708784 0.323215912 0.171697392 0.0003271869
## MasVnrArea 0.0378290608 0.354460770 0.260020168 0.1096138077
## BsmtFinSF1 0.0791592516 0.168849004 0.049209772 -0.0942213086
## BsmtFinSF2 0.0229684363 -0.060610594 -0.020889658 -0.0437154419
## BsmtUnfSF -0.0995440528 0.253954093 -0.003077859 0.1937302335
## TotalBsmtSF -0.0041551333 0.392206552 0.039654873 0.0701792954
## X1stFlrSF -0.0280732022 0.459400155 0.007522238 0.1444867442
## X2ndFlrSF -0.0246742601 0.411736762 0.537390333 0.5785577967
## LowQualFinSF -0.0134975818 0.008561864 -0.030175941 0.1514018069
## GrLivArea -0.0404756297 0.642375016 0.417847566 0.5682108253
## BsmtFullBath -0.1361956351 0.019440640 -0.011698472 -0.0990199381
## BsmtHalfBath 1.0000000000 -0.091267748 -0.007769430 0.0197355422
## FullBath -0.0912677477 1.000000000 0.130732330 0.3927814312
## HalfBath -0.0077694301 0.130732330 1.000000000 0.2159327616
## BedroomAbvGr 0.0197355422 0.392781431 0.215932762 1.0000000000
## KitchenAbvGr -0.0304846954 0.174014882 -0.106025118 0.1507613694
## TotRmsAbvGrd -0.0560859661 0.574810859 0.324532042 0.6775038509
## Fireplaces 0.0201455382 0.282806924 0.232194638 0.1555681897
## GarageYrBlt -0.0066238021 0.411006686 0.208177429 -0.0545290540
## GarageCars 0.0004902256 0.470736309 0.238102447 0.0913750111
## GarageArea 0.0054796182 0.418959010 0.180775474 0.0561159618
## WoodDeckSF 0.0354795574 0.231975787 0.083135535 0.0294211851
## OpenPorchSF -0.0251618573 0.220458505 0.184205108 0.1115820634
## EnclosedPorch -0.0408969773 -0.126014065 -0.117835501 0.0521008419
## X3SsnPorch 0.0654300393 0.016033824 0.037521372 -0.0090476342
## ScreenPorch -0.0048850467 0.016956458 0.080437495 0.0819139077
## PoolArea 0.0776798339 -0.007244223 0.025000048 0.0355388285
## MiscVal -0.0111828857 -0.026956401 -0.055055315 -0.0223507184
## MoSold 0.0155478955 0.042139011 0.008944542 0.0576824297
## YrSold -0.0589132695 -0.022510252 -0.048204444 -0.0643252051
## SalePrice -0.0019700863 0.566740338 0.339805234 0.2160076068
## AgeAtSale -0.0217245295 -0.438295175 -0.252525918 0.0669936283
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars
## Id -0.023401875 0.005085362 0.008473905 0.004330559 -0.0001871438
## MSSubClass 0.441858582 0.199538860 -0.025726795 -0.036827986 -0.0304378654
## LotArea -0.031203353 0.197419943 0.303203671 0.063282117 0.2241061296
## OverallQual -0.136340432 0.455275344 0.445477486 0.502061406 0.6082166888
## OverallCond -0.066524785 -0.074249810 -0.046408969 -0.236380028 -0.1693771680
## YearBuilt -0.210917138 0.080545460 0.242856318 0.780811514 0.5472277342
## YearRemodAdd -0.130313493 0.130752921 0.110437871 0.512296412 0.3703407757
## MasVnrArea -0.036974643 0.309220169 0.268966858 0.355357650 0.4341289806
## BsmtFinSF1 -0.061503305 0.073172683 0.285069181 0.294241975 0.3335754373
## BsmtFinSF2 -0.048287367 -0.051956262 0.058216119 -0.001802506 -0.0034738476
## BsmtUnfSF 0.018940083 0.252758832 0.056152013 0.129055265 0.1860054622
## TotalBsmtSF -0.064266234 0.295782531 0.370190960 0.425599256 0.5197260666
## X1stFlrSF -0.002570116 0.440271716 0.460004164 0.383168193 0.5222703035
## X2ndFlrSF 0.140692483 0.652755703 0.204607992 -0.030264466 0.1473055987
## LowQualFinSF 0.011127107 0.179144977 -0.035077316 -0.089984746 -0.1045119807
## GrLivArea 0.108553193 0.836663589 0.476973811 0.234351994 0.4677274135
## BsmtFullBath -0.049122088 -0.035065599 0.123015140 0.201751509 0.1656080428
## BsmtHalfBath -0.030484695 -0.056085966 0.020145538 -0.006623802 0.0004902256
## FullBath 0.174014882 0.574810859 0.282806924 0.411006686 0.4707363087
## HalfBath -0.106025118 0.324532042 0.232194638 0.208177429 0.2381024469
## BedroomAbvGr 0.150761369 0.677503851 0.155568190 -0.054529054 0.0913750111
## KitchenAbvGr 1.000000000 0.240166904 -0.099630056 -0.166511210 -0.0365934148
## TotRmsAbvGrd 0.240166904 1.000000000 0.351792126 0.134455453 0.3550874341
## Fireplaces -0.099630056 0.351792126 1.000000000 0.161075101 0.3265758686
## GarageYrBlt -0.166511210 0.134455453 0.161075101 1.000000000 0.6653460368
## GarageCars -0.036593415 0.355087434 0.326575869 0.665346037 1.0000000000
## GarageArea -0.048666291 0.325033861 0.282556271 0.667554488 0.8962457653
## WoodDeckSF -0.084502194 0.154060230 0.205871356 0.306703619 0.2952581473
## OpenPorchSF -0.032090697 0.229823575 0.158030643 0.141370803 0.1617362530
## EnclosedPorch 0.086887819 0.002858394 -0.105589401 -0.324049169 -0.1605543387
## X3SsnPorch -0.027910613 0.002951087 -0.027536537 0.050147073 0.0349356013
## ScreenPorch -0.048987716 0.060826861 0.208893585 -0.009249931 0.0436412649
## PoolArea -0.011203489 -0.009867632 0.027795577 -0.032532082 0.0222191055
## MiscVal 0.015862258 -0.032963307 -0.023866778 -0.016535865 -0.0540296818
## MoSold 0.020164490 0.033196699 0.008965144 -0.007270497 0.0181558259
## YrSold -0.009185670 -0.079425737 -0.027513015 0.002061593 -0.0482939056
## SalePrice -0.118016676 0.541576554 0.520558284 0.512555981 0.6524512682
## AgeAtSale 0.210479451 -0.083506066 -0.243788863 -0.780395923 -0.5488123130
## GarageArea WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Id 0.010360966 -0.04577445 -0.012088268 0.006191153 -0.0447543072
## MSSubClass -0.074054376 -0.03664879 -0.006616687 0.047757220 -0.0641133767
## LotArea 0.228741545 0.20405422 0.110835032 -0.036208670 0.0415910854
## OverallQual 0.556039810 0.26967422 0.261312933 -0.143470852 -0.0011505160
## OverallCond -0.135474973 -0.01219271 -0.024727444 0.031693529 0.0659151977
## YearBuilt 0.489897207 0.28025168 0.148461955 -0.422539191 0.0525336628
## YearRemodAdd 0.339271194 0.24696657 0.147292860 -0.241521583 0.0692085136
## MasVnrArea 0.441508799 0.23366712 0.099611110 -0.158127597 0.0016926525
## BsmtFinSF1 0.375789219 0.23988749 0.145155194 -0.143636712 0.0210997190
## BsmtFinSF2 0.007852866 0.10176366 0.013583386 -0.032742899 -0.0271517770
## BsmtUnfSF 0.145591955 -0.02076452 0.096703854 0.017787272 0.0002356145
## TotalBsmtSF 0.529212774 0.26739023 0.247064638 -0.143990779 0.0113729885
## X1stFlrSF 0.543036385 0.30611153 0.221026630 -0.131021455 0.0527851393
## X2ndFlrSF 0.106770028 0.04595860 0.179353015 0.095570813 -0.0312375867
## LowQualFinSF -0.068292563 -0.02654542 0.017376421 0.050490384 -0.0075632547
## GrLivArea 0.455359450 0.24713794 0.296173980 -0.011689540 0.0119111671
## BsmtFullBath 0.202026100 0.16851863 0.066650395 -0.110295667 0.0035722852
## BsmtHalfBath 0.005479618 0.03547956 -0.025161857 -0.040896977 0.0654300393
## FullBath 0.418959010 0.23197579 0.220458505 -0.126014065 0.0160338236
## HalfBath 0.180775474 0.08313553 0.184205108 -0.117835501 0.0375213719
## BedroomAbvGr 0.056115962 0.02942119 0.111582063 0.052100842 -0.0090476342
## KitchenAbvGr -0.048666291 -0.08450219 -0.032090697 0.086887819 -0.0279106127
## TotRmsAbvGrd 0.325033861 0.15406023 0.229823575 0.002858394 0.0029510874
## Fireplaces 0.282556271 0.20587136 0.158030643 -0.105589401 -0.0275365373
## GarageYrBlt 0.667554488 0.30670362 0.141370803 -0.324049169 0.0501470733
## GarageCars 0.896245765 0.29525815 0.161736253 -0.160554339 0.0349356013
## GarageArea 1.000000000 0.29420256 0.192798635 -0.134246874 0.0446375168
## WoodDeckSF 0.294202564 1.00000000 0.030751748 -0.163649643 0.0112990601
## OpenPorchSF 0.192798635 0.03075175 1.000000000 -0.144953640 -0.0029524114
## EnclosedPorch -0.134246874 -0.16364964 -0.144953640 1.000000000 -0.0610071747
## X3SsnPorch 0.044637517 0.01129906 -0.002952411 -0.061007175 1.0000000000
## ScreenPorch 0.047815348 -0.09022230 0.120883120 -0.106555424 -0.0381667912
## PoolArea 0.039392620 0.02535036 -0.029246218 0.155303012 -0.0070817515
## MiscVal -0.036670177 -0.02801053 -0.020976509 -0.024786225 0.0062928731
## MoSold 0.008793856 0.03438502 0.105934597 -0.086582786 0.0543260922
## YrSold -0.034170745 0.04260934 -0.042470684 -0.033169470 0.0061757417
## SalePrice 0.633305006 0.34838218 0.272704102 -0.171166656 0.0220926009
## AgeAtSale -0.490973952 -0.27852345 -0.149999455 0.421105416 -0.0522780211
## ScreenPorch PoolArea MiscVal MoSold YrSold
## Id 0.010142068 -0.023782507 -0.033044711 0.0209367886 0.041381444
## MSSubClass 0.026662815 -0.013957198 -0.017313856 -0.0222975847 -0.034270415
## LotArea 0.028175899 0.037804421 0.024315317 -0.0020030558 -0.044487296
## OverallQual 0.075581831 0.032688407 -0.018628474 0.0561598755 -0.046477145
## OverallCond 0.087323870 -0.032863505 0.071427779 0.0327817131 0.075629822
## YearBuilt -0.023828456 -0.013680301 -0.016779018 -0.0001492084 0.007389979
## YearRemodAdd 0.012062364 0.020006165 0.004908762 0.0222041172 0.093102715
## MasVnrArea 0.038251562 -0.008463973 -0.025889577 -0.0153313469 -0.016194832
## BsmtFinSF1 0.035624870 0.051913386 0.004564695 -0.0217237560 0.045788190
## BsmtFinSF2 0.065055835 0.078378261 -0.017553525 -0.0591126904 0.044933717
## BsmtUnfSF 0.017723013 -0.067326782 -0.028751574 0.0337958583 -0.083825086
## TotalBsmtSF 0.079034840 0.020071808 -0.029523891 -0.0133305673 -0.015021335
## X1stFlrSF 0.093181436 0.023841599 -0.037508446 0.0012206862 -0.029749520
## X2ndFlrSF 0.068940362 0.025140118 -0.024132396 0.0578180614 -0.075174420
## LowQualFinSF 0.038248170 0.123236596 -0.008031946 -0.0294487014 -0.040035047
## GrLivArea 0.123502130 0.051381868 -0.046003696 0.0420181834 -0.084039695
## BsmtFullBath -0.017347176 0.063381146 -0.033475603 -0.0649779728 0.112956054
## BsmtHalfBath -0.004885047 0.077679834 -0.011182886 0.0155478955 -0.058913270
## FullBath 0.016956458 -0.007244223 -0.026956401 0.0421390114 -0.022510252
## HalfBath 0.080437495 0.025000048 -0.055055315 0.0089445418 -0.048204444
## BedroomAbvGr 0.081913908 0.035538829 -0.022350718 0.0576824297 -0.064325205
## KitchenAbvGr -0.048987716 -0.011203489 0.015862258 0.0201644900 -0.009185670
## TotRmsAbvGrd 0.060826861 -0.009867632 -0.032963307 0.0331966989 -0.079425737
## Fireplaces 0.208893585 0.027795577 -0.023866778 0.0089651440 -0.027513015
## GarageYrBlt -0.009249931 -0.032532082 -0.016535865 -0.0072704967 0.002061593
## GarageCars 0.043641265 0.022219105 -0.054029682 0.0181558259 -0.048293906
## GarageArea 0.047815348 0.039392620 -0.036670177 0.0087938565 -0.034170745
## WoodDeckSF -0.090222304 0.025350365 -0.028010525 0.0343850193 0.042609338
## OpenPorchSF 0.120883120 -0.029246218 -0.020976509 0.1059345966 -0.042470684
## EnclosedPorch -0.106555424 0.155303012 -0.024786225 -0.0865827864 -0.033169470
## X3SsnPorch -0.038166791 -0.007081751 0.006292873 0.0543260922 0.006175742
## ScreenPorch 1.000000000 -0.015320381 0.013011765 0.0030214365 -0.001870863
## PoolArea -0.015320381 1.000000000 -0.004921803 -0.0878305931 -0.075501117
## MiscVal 0.013011765 -0.004921803 1.000000000 -0.0271438064 0.014712001
## MoSold 0.003021436 -0.087830593 -0.027143806 1.0000000000 -0.118265069
## YrSold -0.001870863 -0.075501117 0.014712001 -0.1182650692 1.000000000
## SalePrice 0.123848829 0.015537571 -0.025999230 -0.0184071939 -0.043878232
## AgeAtSale 0.023747588 0.010826938 0.017326606 -0.0043110962 0.030327145
## SalePrice AgeAtSale
## Id -0.035167337 0.041500216
## MSSubClass -0.044642350 0.072510112
## LotArea 0.311834688 -0.076975968
## OverallQual 0.791809766 -0.549196907
## OverallCond -0.076108324 0.322424570
## YearBuilt 0.561306440 -0.999288616
## YearRemodAdd 0.445747691 -0.554239145
## MasVnrArea 0.592684037 -0.417657819
## BsmtFinSF1 0.500689216 -0.390383608
## BsmtFinSF2 0.012493079 -0.008532442
## BsmtUnfSF 0.180504169 -0.107972497
## TotalBsmtSF 0.693068292 -0.508715453
## X1stFlrSF 0.685510964 -0.412249685
## X2ndFlrSF 0.303429101 0.090450968
## LowQualFinSF -0.028278444 0.173354759
## GrLivArea 0.712605540 -0.197868807
## BsmtFullBath 0.259817212 -0.274482444
## BsmtHalfBath -0.001970086 -0.021724530
## FullBath 0.566740338 -0.438295175
## HalfBath 0.339805234 -0.252525918
## BedroomAbvGr 0.216007607 0.066993628
## KitchenAbvGr -0.118016676 0.210479451
## TotRmsAbvGrd 0.541576554 -0.083506066
## Fireplaces 0.520558284 -0.243788863
## GarageYrBlt 0.512555981 -0.780395923
## GarageCars 0.652451268 -0.548812313
## GarageArea 0.633305006 -0.490973952
## WoodDeckSF 0.348382180 -0.278523453
## OpenPorchSF 0.272704102 -0.149999455
## EnclosedPorch -0.171166656 0.421105416
## X3SsnPorch 0.022092601 -0.052278021
## ScreenPorch 0.123848829 0.023747588
## PoolArea 0.015537571 0.010826938
## MiscVal -0.025999230 0.017326606
## MoSold -0.018407194 -0.004311096
## YrSold -0.043878232 0.030327145
## SalePrice 1.000000000 -0.562718394
## AgeAtSale -0.562718394 1.000000000
ggplot(ames_housing, aes(x = Renovated, y = SalePrice, fill = Renovated)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "red") +
labs(title = "Impact of Renovations on Sale Prices",
x = "Renovation Status",
y = "Sale Price") +
scale_fill_manual(values = c("Not Renovated" = "#1f77b4", "Renovated" = "#ff7f0e")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
How does energy efficiency and utilities impact the sale price of a house?
Data analysis reveals the significant impact of energy efficiency and utilities on house sale prices. Homes with all public utilities command higher prices across all heating quality categories, particularly those with excellent heating systems. Conversely, properties with fair or poor heating quality, regardless of utilities, have notably lower sale prices. The distribution of sale prices emphasizes these trends, highlighting the importance of investing in quality heating systems and ensuring access to basic utility services to maximize home sale prices.
Sale Price of House by Utility and Heating Quality
The chart demonstrates that homes with all public utilities generally command higher prices across all heating quality categories, with superior heating quality associated with the highest prices. Conversely, properties with fair or poor heating quality, regardless of utilities, tend to have lower sale prices, emphasizing the negative impact of suboptimal heating on property values. This underscores the importance of essential services and effective heating in enhancing residential property marketability and value.
ggplot(ames_housing, aes(x = Utilities, y = SalePrice, fill = HeatingQC)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
facet_wrap(~ HeatingQC, scales = "free_y") +
labs(title = "Sale Prices by Utility Type and Heating Quality",
x = "Utilities", y = "Sale Price") +
scale_fill_viridis_d(option = "inferno") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Density of Sale Price by Utilities and Heating Quality
This density plot series reveals sale price distributions categorized by utilities (All Public Utilities vs. No Sewer/Water) and heating quality (Excellent, Good, Average, Fair, Poor). Homes with all public utilities generally exhibit higher concentrations of sale prices around favorable values, particularly for excellent and good heating quality, indicating higher median sale prices. Conversely, densities for homes without sewer and water are not visible, suggesting fewer data points or lower prices. Overall, the data highlights the significant impact of utilities and heating quality on home values.
ames_housing$Utilities <- ifelse(ames_housing$Utilities == "NoSeWa", "AllPub", ames_housing$Utilities)
ames_housing$HeatingQC <- ifelse(ames_housing$HeatingQC == "Po", "Fa", ames_housing$HeatingQC)
table(ames_housing$Utilities)
##
## AllPub
## 1460
table(ames_housing$HeatingQC)
##
## Ex Fa Gd TA
## 741 50 241 428
ames_housing$Utilities <- ifelse(ames_housing$Utilities == "NoSeWa", "AllPub", ames_housing$Utilities)
ames_housing$HeatingQC <- ifelse(ames_housing$HeatingQC == "Po", "Fa", ames_housing$HeatingQC)
ggplot(ames_housing, aes(x = SalePrice, fill = Utilities)) +
geom_density(alpha = 0.6) +
facet_wrap(~ HeatingQC, scales = "free") +
labs(title = "Density of Sale Prices by Utilities and Heating Quality",
x = "Sale Price", y = "Density") +
scale_fill_viridis_d(option = "plasma") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Interaction of Utilities and Heating Quality on Sale Price of House
This plot shows how utilities and heating quality influence sale prices. Homes with all public utilities (AllPub) exhibit a wide range of sale prices, indicating varied heating qualities from excellent to poor. Conversely, homes without sewer or water (NoSeWa) are rare and generally have lower sale prices, highlighting the importance of standard utilities in maintaining property value.
ggplot(ames_housing, aes(x = Utilities, y = SalePrice, color = HeatingQC)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, fill = "lightblue", alpha = 0.3, aes(group = HeatingQC)) +
labs(title = "Interaction of Utilities and Heating Quality on Sale Prices",
x = "Utilities", y = "Sale Price") +
scale_color_brewer(palette = "Dark2") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
## `geom_smooth()` using formula = 'y ~ x'
stats_summary <- ames_housing %>%
group_by(Utilities, HeatingQC) %>%
summarise(
Count = n(),
Mean = mean(SalePrice, na.rm = TRUE),
Median = median(SalePrice, na.rm = TRUE),
SD = sd(SalePrice, na.rm = TRUE)
)
## `summarise()` has grouped output by 'Utilities'. You can override using the
## `.groups` argument.
print(stats_summary)
## # A tibble: 4 × 6
## # Groups: Utilities [1]
## Utilities HeatingQC Count Mean Median SD
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 AllPub Ex 741 214914. 194700 87470.
## 2 AllPub Fa 50 123181. 122750 50064.
## 3 AllPub Gd 241 156859. 152000 52924.
## 4 AllPub TA 428 142363. 135000 47226.
correlations <- ames_housing %>%
group_by(Utilities, HeatingQC) %>%
summarise(Correlation = cor(SalePrice, LotArea, use = "complete.obs"), .groups = 'drop')
print(correlations)
## # A tibble: 4 × 3
## Utilities HeatingQC Correlation
## <chr> <chr> <dbl>
## 1 AllPub Ex 0.256
## 2 AllPub Fa 0.566
## 3 AllPub Gd 0.334
## 4 AllPub TA 0.453
Comparison of Sale Prices: High vs Low Heating Quality
This density plot contrasts sale price distributions for homes with high and low heating quality. The green curve suggests clustered pricing for high-quality heating homes, while the absence of the red curve within the visible range implies limited data or overlap with higher-quality homes. The visualization emphasizes the significant influence of heating quality on home prices, with better quality likely contributing to higher and more concentrated values, reflecting buyer preference for comfort and efficiency.
high_efficiency <- ames_housing %>% filter(HeatingQC == "Ex")
low_efficiency <- ames_housing %>% filter(HeatingQC == "Po")
ggplot() +
geom_density(data = high_efficiency, aes(x = SalePrice, fill = "High"), alpha = 0.5) +
geom_density(data = low_efficiency, aes(x = SalePrice, fill = "Low"), alpha = 0.5) +
labs(title = "Comparison of Sale Prices: High vs Low Heating Quality",
x = "Sale Price", y = "Density",
fill = "Heating Quality") +
scale_fill_manual(values = c("High" = "green", "Low" = "red")) +
theme_minimal() +
theme(
legend.position = "top",
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
What is the impact of landscape and outdoor features on the sale price of a house?
Comprehensive data analysis reveals landscape and outdoor features, particularly pools, significantly impact house sale prices. Homes with pools command higher median sale prices, indicating their substantial role in enhancing property value, adding luxury and aesthetic appeal. Variability in pool-equipped home prices suggests factors like size, style, and maintenance influence sale prices. Distribution analysis shows broader price ranges and significant right skew for pool-equipped homes compared to those without, emphasizing their desirability. The positive correlation between lot area and sale price for pool-equipped homes highlights the added value of larger lots, further boosting property values. Overall, this data underscores the significance of landscape features, especially pools, in influencing home sale prices, showcasing their substantial contribution to property valuation in the real estate market.
Median Sale Price of House by Pool
This bar chart compares median sale prices of homes with and without pools, showing higher prices for pool-equipped homes, emphasizing their significant role in enhancing property value. The broader range of sale prices among homes with pools suggests factors like size, style, and maintenance influence prices. Overall, the data highlights pools as desirable features significantly impacting home valuation in the real estate market.
str(ameshous_test_data)
## 'data.frame': 1459 obs. of 80 variables:
## $ Id : int 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
## $ MSSubClass : int 20 20 60 60 120 60 20 60 20 20 ...
## $ MSZoning : chr "RH" "RL" "RL" "RL" ...
## $ LotFrontage : int 80 81 74 78 43 75 NA 63 85 70 ...
## $ LotArea : int 11622 14267 13830 9978 5005 10000 7980 8402 10176 8400 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "IR1" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "Corner" "Inside" "Inside" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "NAmes" "NAmes" "Gilbert" "Gilbert" ...
## $ Condition1 : chr "Feedr" "Norm" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "1Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 5 6 5 6 8 6 6 6 7 4 ...
## $ OverallCond : int 6 6 5 6 5 5 7 5 5 5 ...
## $ YearBuilt : int 1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
## $ YearRemodAdd : int 1961 1958 1998 1998 1992 1994 2007 1998 1990 1970 ...
## $ RoofStyle : chr "Gable" "Hip" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
## $ Exterior2nd : chr "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
## $ MasVnrType : chr "None" "BrkFace" "None" "BrkFace" ...
## $ MasVnrArea : int 0 108 0 20 0 0 0 0 0 0 ...
## $ ExterQual : chr "TA" "TA" "TA" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "CBlock" "CBlock" "PConc" "PConc" ...
## $ BsmtQual : chr "TA" "TA" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "TA" ...
## $ BsmtExposure : chr "No" "No" "No" "No" ...
## $ BsmtFinType1 : chr "Rec" "ALQ" "GLQ" "GLQ" ...
## $ BsmtFinSF1 : int 468 923 791 602 263 0 935 0 637 804 ...
## $ BsmtFinType2 : chr "LwQ" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 144 0 0 0 0 0 0 0 0 78 ...
## $ BsmtUnfSF : int 270 406 137 324 1017 763 233 789 663 0 ...
## $ TotalBsmtSF : int 882 1329 928 926 1280 763 1168 789 1300 882 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "TA" "TA" "Gd" "Ex" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 896 1329 928 926 1280 763 1187 789 1341 882 ...
## $ X2ndFlrSF : int 0 0 701 678 0 892 0 676 0 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 896 1329 1629 1604 1280 1655 1187 1465 1341 882 ...
## $ BsmtFullBath : int 0 0 0 0 0 0 1 0 1 1 ...
## $ BsmtHalfBath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 1 1 2 2 2 2 2 2 1 1 ...
## $ HalfBath : int 0 1 1 1 0 1 0 1 1 0 ...
## $ BedroomAbvGr : int 2 3 3 3 2 3 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ KitchenQual : chr "TA" "Gd" "TA" "Gd" ...
## $ TotRmsAbvGrd : int 5 6 6 7 5 7 6 7 5 4 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 0 1 1 0 1 0 1 1 0 ...
## $ FireplaceQu : chr NA NA "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ GarageYrBlt : int 1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
## $ GarageFinish : chr "Unf" "Unf" "Fin" "Fin" ...
## $ GarageCars : int 1 1 2 2 2 2 2 2 2 2 ...
## $ GarageArea : int 730 312 482 470 506 440 420 393 506 525 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 140 393 212 360 0 157 483 0 192 240 ...
## $ OpenPorchSF : int 0 36 34 36 82 84 21 75 0 0 ...
## $ EnclosedPorch: int 0 0 0 0 0 0 0 0 0 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ScreenPorch : int 120 0 0 0 144 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr "MnPrv" NA "MnPrv" NA ...
## $ MiscFeature : chr NA "Gar2" NA NA ...
## $ MiscVal : int 0 12500 0 0 0 0 500 0 0 0 ...
## $ MoSold : int 6 6 3 6 1 4 3 5 2 4 ...
## $ YrSold : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Normal" ...
table(ames_housing$PoolQC)
##
## Ex Fa Gd None
## 2 2 3 1453
ames_housing$HasPool <- factor(ifelse(ames_housing$PoolQC %in% c("Ex", "Gd", "TA", "Fa"), "Yes", "No"),
levels = c("No", "Yes"),
labels = c("No Pool", "Has Pool"))
# Plot
ggplot(ames_housing, aes(x = factor(HasPool), y = SalePrice, fill = factor(HasPool))) +
stat_summary(fun = median, geom = "bar", position = position_dodge(width = 0.8), width = 0.6) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(width = 0.8), width = 0.2) +
labs(title = "Median Sale Prices by Pool Presence",
x = "Has Pool", y = "Median Sale Price") +
scale_fill_manual(values = c("No Pool" = "lightblue", "Has Pool" = "orange")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Density of Sale Price by availability pool in the House
This density plot compares sale price distributions based on pool presence. Homes without a pool exhibit a tighter distribution with lower median prices, while those with a pool show a broader distribution and significant right skew, indicating higher median prices and a tail of high-value transactions. The data suggests pools add a luxury element, significantly boosting property values, particularly at the upper end of the market.
ggplot(ames_housing, aes(x = SalePrice, fill = factor(HasPool))) +
geom_density(alpha = 0.5) +
geom_vline(aes(xintercept = median(SalePrice)), color = "black", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = quantile(SalePrice, 0.25)), color = "red", linetype = "dashed", size = 0.8) +
geom_vline(aes(xintercept = quantile(SalePrice, 0.75)), color = "blue", linetype = "dashed", size = 0.8) +
facet_wrap(~ HasPool) +
labs(title = "Density of Sale Prices by Pool Presence",
x = "Sale Price", y = "Density", fill = "Has Pool") +
scale_fill_manual(values = c("No Pool" = "lightblue", "Has Pool" = "orange")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Sale Price of House by availability of Pool
This plot compares sale price distributions for homes with and without pools using overlaid boxplots and individual data points. Homes without pools (left) cluster around a lower median with tight interquartile ranges and outliers into higher price ranges. In contrast, homes with pools (right) exhibit slightly higher median prices, broader interquartile ranges, and fewer outliers, indicating a more consistent valuation at higher prices. The red arrows emphasize higher median prices for pool-equipped homes, highlighting the influence of pools on home values towards higher sale prices.
ggplot(ames_housing, aes(x = factor(HasPool), y = SalePrice, fill = factor(HasPool))) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5, color = "black") +
stat_summary(fun = median, geom = "point", shape = 18, size = 4, color = "red") +
stat_summary(fun.data = function(x) {
quantiles <- quantile(x, c(0.25, 0.75))
data.frame(y = quantiles, ymin = quantiles[1], ymax = quantiles[2])
}, geom = "errorbar", width = 0.2, color = "blue") +
labs(title = "Sale Prices by Pool Presence",
x = "Has Pool", y = "Sale Price") +
scale_fill_manual(values = c("No Pool" = "lightblue", "Has Pool" = "orange")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Sale Price vs Lot Area of House by presence of Pool
This scatter plot depicts the relationship between lot area and sale prices for homes with pools. The blue trend line suggests a positive correlation, indicating that as lot area increases, so does sale price, reflecting the added value of larger lots accommodating pools. Clustering at lower lot sizes with a wide price range suggests even smaller lots with pools can fetch high prices, highlighting the significant value addition of pools across various lot sizes. Outliers at higher lot sizes suggest other factors like location and amenities also influence sale prices.
ggplot(ames_housing, aes(x = LotArea, y = SalePrice, color = factor(HasPool))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, aes(group = 1), color = "blue") +
labs(title = "Sale Price vs. Lot Area by Pool Presence",
x = "Lot Area (sq feet)", y = "Sale Price",
color = "Has Pool") + # Add legend title
scale_color_manual(values = c("No Pool" = "red", "Has Pool" = "green")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "bottom"
)
## `geom_smooth()` using formula = 'y ~ x'
How do neighborhood amenities affect the sale price of a house?
Comprehensive data analysis reveals the crucial role of neighborhood amenities in determining house sale prices. Variation in median sale prices across the top 10 neighborhoods, with areas like NridgHt, NoRidge, and StoneBr consistently commanding higher prices, underscores their desirability and likely higher affluence. Sale price distributions within neighborhoods highlight the complexity of real estate pricing, influenced by factors like location and property features. These insights emphasize the importance of considering neighborhood amenities when assessing sale prices, as they significantly impact market dynamics and buyer perceptions.
Median Sale Price of House among top 10 Neighbourhood
This boxplot illustrates median sale prices across the top 10 neighborhoods, revealing variations in housing market dynamics. Neighborhoods like NridgHt, NoRidge, and StoneBr show higher median prices, indicating greater affluence or desirability. Box lengths represent price range variability within each neighborhood, with wider ranges like Veenker suggesting diverse markets. Outliers denote sales significantly deviating from typical prices, possibly due to unique features or conditions. Overall, the plot underscores neighborhood choice’s substantial impact on home prices in the real estate market.
my_colors <- RColorBrewer::brewer.pal(10, "Set3")
if (length(my_colors) < 10) {
my_colors <- colorRampPalette(my_colors)(10)
}
top_neighborhoods <- ames_housing %>%
group_by(Neighborhood) %>%
summarize(MedianSalePrice = median(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
top_n(10, MedianSalePrice) %>%
arrange(desc(MedianSalePrice)) %>%
pull(Neighborhood)
ames_housing_top <- ames_housing %>%
filter(Neighborhood %in% top_neighborhoods)
ggplot(ames_housing_top, aes(x = reorder(Neighborhood, SalePrice, FUN = median), y = SalePrice, fill = Neighborhood)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5, color = "black") +
stat_summary(fun = median, geom = "point", shape = 18, size = 4, color = "red") +
stat_summary(fun.data = function(x) {
quantiles <- quantile(x, c(0.25, 0.75))
data.frame(y = quantiles, ymin = quantiles[1], ymax = quantiles[2])
}, geom = "errorbar", width = 0.2, color = "blue") +
scale_fill_manual(values = my_colors) +
labs(title = "Median Sale Prices Across Top 10 Neighborhoods",
x = "Neighborhood", y = "Median Sale Price") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Density of Sale Price in top 10 Neighborhood
This plot of densities illustrates the way prices are distributed in ten neighborhoods, each marked by a different color. Neighborhoods such as StoneBr and NridgHt appear to be more affluent due to the higher price peaks that they record, whereas low spikes on Blmngtn and CollgCr indicate cheapness. On the other hand, the red broken lines indicate middle and average values while broader NridgHt and StoneBr point to various types of houses/payments in them respectively. General speaking it demonstrates real estate characteristics at present.
ggplot(ames_housing_top, aes(x = SalePrice, fill = Neighborhood)) +
geom_density(alpha = 0.6, color = "black") +
geom_vline(aes(xintercept = median(SalePrice)), color = "red", linetype = "dashed", size = 1) +
stat_function(
fun = dnorm,
args = list(mean = mean(ames_housing_top$SalePrice), sd = sd(ames_housing_top$SalePrice)),
aes(x = SalePrice), # explicitly define x
inherit.aes = FALSE, # prevent it from using 'fill = Neighborhood'
color = "blue",
linetype = "dotted"
) +
scale_fill_manual(values = my_colors) +
labs(title = "Density of Sale Prices Across Top 10 Neighborhoods",
x = "Sale Price", y = "Density") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Sale Prices Across Top 10 Neighborhoods Over Time
This scatter plot visually represents house prices in the leading 10 different residential areas during 2006-2010 illustration of real estate comprehend housing market dynamics and fluctuations through visual representation. The colors represent years and give trends on pricing based on economics. Areas like NridgHt, NoRidge, and StoneBr consistently have pricey properties which shows how much buyers are willing to pay mainly due to their desirability among other factors like wealthiness’ ; whereas price ranges from high to low within each area’s house listings reflect differences of property value among them may be because of land size or nature of home design (includes living room sizes). These point out towards a Highly perplexing-sounding statement partially attributed to the fact that data.
ggplot(ames_housing_top, aes(x = reorder(Neighborhood, SalePrice, FUN = median), y = SalePrice, color = as.factor(YrSold))) +
geom_jitter(alpha = 0.6, width = 0.3) +
geom_boxplot(alpha = 0, outlier.shape = NA, width = 0.2) +
labs(title = "Sale Prices Across Top 10 Neighborhoods Over Time",
x = "Neighborhood", y = "Sale Price",
color = "Year Sold") +
scale_color_discrete(name = "Year Sold") +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
axis.text.y = element_text(size = 12)
)
Time Series of Median Sale Price in Top 10 Neighborhood
This scatter plot depicts sale prices across the top 10 neighborhoods from 2006 to 2010, highlighting housing market trends over five years. Color-coded by year, it shows temporal patterns and economic impacts on prices. Certain neighborhoods consistently command higher prices, indicating desirability or affluence, while price spreads within neighborhoods suggest varying property values. The data unveil patterns or shifts in market dynamics, potentially linked to broader economic factors or local developments, aiding in understanding home price influences over time.
neighborhood_yearly_top <- ames_housing_top %>%
group_by(Neighborhood, YrSold) %>%
summarize(MedianSalePrice = median(SalePrice, na.rm = TRUE), .groups = 'drop')
ggplot(neighborhood_yearly_top, aes(x = factor(YrSold), y = MedianSalePrice, group = Neighborhood, color = Neighborhood)) +
geom_line() +
scale_color_brewer(type = "qual", palette = "Paired") +
labs(title = "Time Series of Median Sale Prices by Top 10 Neighborhoods",
x = "Year Sold", y = "Median Sale Price",
color = "Neighborhood") +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
axis.text.y = element_text(size = 12)
)
2D Density Map of Sale Prices and Lot Area by Top 10 Neighborhoods
This 2D density map showcases sale prices relative to lot area across the top 10 neighborhoods. Denser colors indicate higher sales concentration at specific price points and lot sizes. The plot highlights two main concentrations: one at lower prices and smaller lots, and another, less dense area at higher prices and larger lots. This reflects typical property characteristics within neighborhoods, with smaller, more affordable homes dominating, while a smaller segment features larger, pricier properties. The visualization aids in understanding real estate trends and informs property investment decisions based on lot size and expected sale price ranges.
ggplot(ames_housing_top, aes(x = LotArea, y = SalePrice, color = Neighborhood)) +
geom_point(alpha = 0.6) +
geom_density_2d_filled(contour_var = "ndensity", aes(fill = ..level..)) +
scale_color_manual(values = my_colors) +
scale_fill_manual(values = my_colors) +
labs(title = "2D Density Map of Sale Prices and Lot Area by Top 10 Neighborhoods",
x = "Lot Area", y = "Sale Price") +
theme_minimal()
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Plotting Maps of Top 10 Neighborhood where Sale Price of House is at Maximum
The R code generates an interactive Leaflet map showcasing median sale prices in selected Ames, Iowa neighborhoods. Markers are color-coded—red for homes above $200,000 and blue for those below. The map includes polylines and a polygon to highlight top-priced neighborhoods and offers interactive features like zoom controls and layer toggles, aiding stakeholders in analyzing the real estate market efficiently.
median_prices <- ames_housing %>%
group_by(Neighborhood) %>%
summarize(MedianSalePrice = median(SalePrice, na.rm = TRUE), .groups = 'drop')
neighborhoods_from_plot <- c("Blmngtn", "ClearCr", "CollgCr", "Crawfor", "NoRidge",
"NridgHt", "Somrst", "StoneBr", "Timber", "Veenker")
filtered_data <- median_prices %>%
filter(Neighborhood %in% neighborhoods_from_plot)
# Manually inputing coordinates for Ames Neighborhood
neighborhood_coords <- data.frame(
Neighborhood = neighborhoods_from_plot,
Latitude = c(42.05905, 41.6668, 42.02109528, 42.020579, 42.05055618,
42.05963516, 41.6449, 42.06128, 41.72098, 42.02369),
Longitude = c(-93.63793, -93.6668, -93.68562317, -95.3811884, -93.62717438,
-93.65499878, -91.48731, -93.63313, -91.47446, -93.64669)
)
full_data <- merge(neighborhood_coords, filtered_data, by = "Neighborhood")
pal <- colorNumeric(palette = "Viridis", domain = full_data$MedianSalePrice)
final_map <- leaflet(full_data) %>%
addTiles() %>%
setView(lng = -93.6250, lat = 42.0308, zoom = 12)
final_map <- final_map %>%
addAwesomeMarkers(
~Longitude, ~Latitude,
icon = makeAwesomeIcon(
icon = 'home',
markerColor = ~ifelse(MedianSalePrice > 200000, 'red', 'blue')
),
popup = ~paste(Neighborhood, "<br> Median Sale Price: $", format(MedianSalePrice, big.mark=",", scientific=FALSE))
)
top_neighborhoods <- full_data %>%
top_n(3, MedianSalePrice) %>%
arrange(desc(MedianSalePrice))
final_map <- final_map %>%
addPolylines(
lng = top_neighborhoods$Longitude,
lat = top_neighborhoods$Latitude,
color = "red",
weight = 5,
opacity = 0.7
)
final_map <- final_map %>%
addPolygons(
lng = c(min(top_neighborhoods$Longitude) - 0.01, max(top_neighborhoods$Longitude) + 0.01,
max(top_neighborhoods$Longitude) + 0.01, min(top_neighborhoods$Longitude) - 0.01),
lat = c(min(top_neighborhoods$Latitude) - 0.01, min(top_neighborhoods$Latitude) - 0.01,
max(top_neighborhoods$Latitude) + 0.01, max(top_neighborhoods$Latitude) + 0.01),
fillColor = "#ff7800",
fillOpacity = 0.5,
weight = 3,
color = "orange",
opacity = 0.8
)
final_map <- final_map %>%
addLayersControl(
overlayGroups = c("Price Markers", "Top Priced Route"),
options = layersControlOptions(collapsed = FALSE)
)
final_map
How do market dynamics influence the sale price of a house?
Market dynamics significantly impact house sale prices, as revealed by comprehensive data analysis spanning several years. Fluctuations in average sale prices, depicted by line graphs and heatmaps, illustrate the influence of broader economic conditions and seasonal factors on real estate markets. Sharp declines in prices, such as those in 2007, followed by subsequent recoveries and fluctuations, suggest market volatility possibly influenced by events like the global financial crisis. Peaks and troughs in monthly sale prices indicate seasonal trends or market activity variations throughout the year. Overall trends reveal broader market trends over time, like a decline in property values from 2008 onwards, likely attributed to economic downturns. Understanding these dynamics is crucial for informed decisions regarding property investments and market strategies.
Average Sale Price Over time by Year
This line graph shows average home sale prices from 2006 to 2010, with each year depicted by a different colored line. Fluctuations in prices, like the sharp decline in 2007 followed by a recovery in 2008, hint at market volatility possibly linked to economic events such as the global financial crisis. The subsequent ups and downs in 2009 and 2010 indicate continued market instability or other economic pressures influencing home values. Such insights are vital for investors and policymakers navigating real estate markets.
if (!"YrSold" %in% names(ames_housing) | !"MoSold" %in% names(ames_housing)) {
stop("YrSold and/or MoSold columns are missing")
}
ames_housing$DateSold <- as.Date(paste(ames_housing$YrSold, ames_housing$MoSold, "01", sep = "-"), format = "%Y-%m-%d")
ames_housing$Year <- factor(ames_housing$YrSold)
daily_avg_prices <- ames_housing %>%
group_by(DateSold, Year) %>%
summarize(AveragePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop')
color_palette <- colorRampPalette(brewer.pal(9, "Set1"))(length(unique(ames_housing$Year)))
ggplot(daily_avg_prices, aes(x = DateSold, y = AveragePrice, group = Year, color = Year)) +
geom_line(size = 1.5, alpha = 0.8) +
scale_color_manual(values = color_palette) +
labs(title = "Average Sale Prices Over Time by Year",
x = "Date Sold", y = "Average Sale Price") +
theme_minimal() +
theme(
legend.position = "bottom",
legend.title = element_text(size = 14, face = "bold"),
legend.text = element_text(size = 12),
axis.text = element_text(size = 12),
axis.title = element_text(size = 14)
)
Monthly Average Sales Price
This line graph tracks monthly average home sale prices from 2006 to 2010, with a solid blue line depicting month-to-month fluctuations and a dashed red line showing the overall trend. Peaks and troughs in the blue line hint at seasonal trends or market volatility, while the downward trend in the red line post-2008 may reflect broader economic challenges impacting property values. This visualization provides insight into how external economic conditions and seasonal factors influence real estate dynamics.
ames_housing$MonthYear <- as.Date(paste(ames_housing$YrSold, ames_housing$MoSold, "01", sep = "-"), "%Y-%m-%d")
monthly_prices <- ames_housing %>%
group_by(MonthYear) %>%
summarize(AveragePrice = mean(SalePrice, na.rm = TRUE))
ggplot(monthly_prices, aes(x = MonthYear, y = AveragePrice)) +
geom_line(color = "dodgerblue", size = 1.2) +
geom_smooth(method = "loess", se = FALSE, color = "red", linetype = "dashed") +
labs(title = "Monthly Average Sale Prices",
x = "Month-Year", y = "Average Sale Price") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "none"
)
## `geom_smooth()` using formula = 'y ~ x'
Heatmap of Average Sales Price by Month and Year
This heatmap visualizes average sale prices from 2006 to 2010, with warmer colors indicating higher prices and cooler colors representing lower prices. Patterns of warmer hues suggest spikes in prices, possibly due to increased market activity, while cooler tones indicate downturns. This visualization offers insights into how real estate prices fluctuate over time, reflecting market dynamics and trends.
monthly_prices$Year <- year(monthly_prices$MonthYear)
monthly_prices$Month <- factor(month(monthly_prices$MonthYear, label = TRUE), levels = month.abb)
ggplot(monthly_prices, aes(x = Month, y = Year, fill = AveragePrice)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "darkred") +
labs(title = "Heatmap of Average Sale Prices by Month and Year",
x = "Month", y = "Year") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.title = element_text(size = 14),
legend.text = element_text(size = 12),
panel.grid = element_blank()
)
How do seasonal trends affect Sale Price of House in Ames, IOWA?
Seasonal trends significantly impact the sale price of houses in Ames, Iowa, as evidenced by comprehensive data analysis across various visualization techniques. The boxplots and violin plots depicting the relationship between sale prices and overall quality reveal a consistent trend: as the quality rating increases, median sale prices generally rise, indicating that higher-quality homes command higher prices in the market. However, significant variability within each quality segment suggests that additional factors beyond quality also influence sale prices. Similarly, the scatter plot highlights a positive correlation between higher quality ratings and sale prices, with houses rated 8, 9, and 10 showing a broader range of prices, indicating varying buyer perceptions of additional qualities or features at these levels. Additionally, the bar chart illustrating median sale prices across different neighborhoods segmented by season demonstrates significant variations in prices both across neighborhoods and seasons. This analysis underscores the nuanced impact of seasonal trends on housing prices, with certain neighborhoods possibly achieving higher median prices in specific seasons due to market dynamics influenced by seasonal factors. Overall, understanding these seasonal trends is essential for potential home buyers and sellers, as well as real estate professionals, to make informed decisions in the Ames, Iowa housing market.
Seasonal Trends in Sales Price
This boxplot reveals seasonal trends in home sale prices, with winter (green) showing lower prices and spring (orange) and summer (blue) indicating higher prices, likely due to increased market activity in warmer months. Fall (pink) sees a slight decline from summer prices. The range of prices within each season suggests variability influenced by factors like home features and neighborhood desirability, highlighting the impact of seasonal trends on real estate values.
ames_housing$Season <- factor(
cut(ames_housing$MoSold, breaks = c(0, 3, 6, 9, 12), labels = c("Winter", "Spring", "Summer", "Fall")),
levels = c("Winter", "Spring", "Summer", "Fall")
)
ggplot(ames_housing, aes(x = Season, y = SalePrice, fill = Season)) +
geom_boxplot(outlier.shape = NA, alpha = 0.8) +
geom_jitter(width = 0.2, size = 2, alpha = 0.5, color = "black") +
scale_fill_brewer(palette = "Set2") +
labs(title = "Seasonal Trends in Sale Prices", x = "Season", y = "Average Sale Price") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "none"
)
Density of Sales Price of House by Season
The density plot reveals seasonal fluctuations in house sale prices. Summer reflects lower prices, while fall and spring indicate more dynamic markets with broader price spreads. Winter shows similar lower-end prices but less activity in higher ranges, suggesting a slowdown in sales of expensive homes during colder months.
ggplot(ames_housing, aes(x = SalePrice, fill = Season)) +
geom_density(alpha = 0.6) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Density of Sale Prices by Season",
x = "Sale Price", y = "Density") +
theme_minimal() +
theme(
legend.position = "top",
plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Violin Plots of Sales Price by Season
The violin plot showcases sale price distributions across seasons. Winter displays a wider base with lower prices and fewer high-value sales, while spring and summer exhibit concentrated distributions around the median with occasional higher-priced sales. Fall presents a symmetrical distribution with a slight skew towards higher values. Accompanying box plots offer insights into median sale prices and their variability, emphasizing seasonal fluctuations in the housing market.
ggplot(ames_housing, aes(x = Season, y = SalePrice, fill = Season)) +
geom_violin(trim = FALSE, alpha = 0.8) +
geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
labs(title = "Violin Plots of Sale Prices by Season",
x = "Season", y = "Sale Price") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "none"
)
seasonal_stats <- ames_housing %>%
group_by(Season) %>%
summarize(
Average = mean(SalePrice, na.rm = TRUE),
Median = median(SalePrice, na.rm = TRUE),
Variance = var(SalePrice, na.rm = TRUE),
SD = sd(SalePrice, na.rm = TRUE)
)
print(seasonal_stats)
## # A tibble: 4 × 5
## Season Average Median Variance SD
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Winter 181961. 165250 8229839609. 90718.
## 2 Spring 174271. 156750 5039986048. 70993.
## 3 Summer 187248. 171000 7285022935. 85352.
## 4 Fall 185773. 167500 5910101316. 76877.
Seasonal Trends affect in Sale Price of Houses in Neighborhood
The bar chart displays median sale prices across Ames neighborhoods by season. Each neighborhood is depicted with colors representing different seasons: Winter (blue), Spring (yellow), Summer (purple), Fall (red), and “NA” (gray) for unclassified data. It reveals significant price variations among neighborhoods and seasons. Neighborhoods like NridgHt and NoRidge consistently command higher prices, while IDOTRR and MeadowV generally have lower median prices. Seasonal fluctuations are evident, with certain neighborhoods experiencing higher median prices in specific seasons, indicating the impact of seasonal factors on housing prices. This analysis provides valuable insights into neighborhood-specific seasonal trends, aiding both buyers and sellers in navigating the real estate market.
ames_housing$Season <- cut(ames_housing$MoSold,
breaks = c(1, 3, 6, 9, 12),
labels = c("Winter", "Spring", "Summer", "Fall"),
right = FALSE)
median_prices_by_season <- ames_housing %>%
group_by(Neighborhood, Season) %>%
summarize(MedianSalePrice = median(SalePrice, na.rm = TRUE), .groups = 'drop')
ggplot(median_prices_by_season, aes(x = Neighborhood, y = MedianSalePrice, fill = Season)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Seasonal Trends in House Sale Prices by Neighborhood",
x = "Neighborhood",
y = "Median Sale Price") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set1")
How do quality and condition of a house impact Sale Price of Houses in Ames?
The visual analyses reveal a consistent relationship between house quality/condition and sale prices in Ames. Both boxplot and violin plot presentations show that higher quality ratings correspond to elevated median sale prices, indicating market preference for better-quality properties. However, variability within each rating suggests other factors influence prices. The scatter plot further highlights a positive correlation between quality ratings and prices, notably in homes rated 8-10, reflecting diverse buyer preferences. While condition also impacts pricing, its effect appears complex, emphasizing the need to consider quality and condition together when assessing property worth in Ames.
Boxplot of Sales Price by Quality and Condition of House
The boxplot indicates higher quality ratings correlate with increased median sale prices, evident from the upward shift in median lines across ratings 1 to 10. Variability within each rating, reflected in whisker lengths and box ranges, suggests other factors influence prices. Homes with the highest ratings show wide price ranges, indicative of diverse buyer perceptions and additional features’ influence. This visualization succinctly demonstrates quality’s impact on home value and variability within quality segments.
palette15 <- colorRampPalette(brewer.pal(9, "Set3"))(15)
ggplot(ames_housing, aes(x = as.factor(OverallQual), y = SalePrice, fill = as.factor(OverallCond))) +
geom_boxplot() +
scale_fill_manual(values = palette15) +
labs(title = "Boxplot of Sale Prices by Quality and Condition",
x = "Overall Quality", y = "Sale Price", fill = "Overall Condition") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "right"
)
Violin Plots of Sales Price by Overall Quality of House
The violin plots display sale price distributions across different overall quality levels, with higher levels correlating with higher median prices, notably levels 8, 9, and 10. Thicker sections indicate denser price clusters. Lower quality levels exhibit fewer data points and lower prices, while higher levels show broader distributions, reflecting both increased sales volume and price variability. This visualization succinctly captures the relationship between quality, sale prices, and price dispersion.
ggplot(ames_housing, aes(x = as.factor(OverallQual), y = SalePrice, fill = as.factor(OverallCond))) +
geom_violin(trim = FALSE, alpha = 0.8) +
scale_fill_manual(values = palette15) +
labs(title = "Violin Plots of Sale Prices by Quality and Condition",
x = "Overall Quality", y = "Sale Price",
fill = "Overall Condition") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "right"
)
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Scatter Plot of Sales Price by Overall Quality of House
The scatter plot illustrates the relationship between sale prices, overall quality, and condition ratings of houses. It demonstrates a positive correlation between higher quality ratings and sale prices, with the highest quality homes exhibiting a wider range of prices. While condition also influences prices, its impact is less pronounced. This visualization emphasizes the significance of quality in determining real estate values.
ggplot(ames_housing, aes(x = as.factor(OverallQual), y = SalePrice, color = as.factor(OverallCond))) +
geom_jitter(alpha = 0.6, shape = 16, width = 0.2) +
scale_color_manual(values = palette15) +
labs(title = "Scatter Plot of Sale Prices by Quality and Condition",
x = "Overall Quality", y = "Sale Price",
color = "Overall Condition") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
quality_condition_stats <- ames_housing %>%
group_by(OverallQual, OverallCond) %>%
summarize(
Count = n(),
Average = mean(SalePrice, na.rm = TRUE),
Median = median(SalePrice, na.rm = TRUE),
Variance = var(SalePrice, na.rm = TRUE),
SD = sd(SalePrice, na.rm = TRUE),
.groups = 'drop'
)
print(quality_condition_stats)
## # A tibble: 52 × 7
## OverallQual OverallCond Count Average Median Variance SD
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 61000 61000 NA NA
## 2 1 3 1 39300 39300 NA NA
## 3 2 3 2 47656. 47656. 304773360. 17458.
## 4 2 5 1 60000 60000 NA NA
## 5 3 2 2 80750 80750 36125000 6010.
## 6 3 3 3 69167. 67000 141583333. 11899.
## 7 3 4 6 91817. 91950 62821667. 7926.
## 8 3 5 2 117300 117300 994580000 31537.
## 9 3 6 5 69760 72500 710033000 26646.
## 10 3 7 1 120000 120000 NA NA
## # ℹ 42 more rows
Overall Quality and Overall Condition of Houses in Neighborhood
This bubble plot effectively captures the relationship between average quality, sales volume, and sale prices across different neighborhoods in the Ames housing dataset. It highlights high-quality, high-priced neighborhoods like “NridgHt,” “NoRidge,” and “StoneBr” with larger, redder bubbles, while neighborhoods such as “OldTown” and “Edwards” exhibit lower prices and quality but higher sales volume, indicated by smaller, bluer bubbles. This visualization aids in understanding neighborhood characteristics and can inform decisions for buyers, developers, and urban planners.
neighborhood_quality_stats <- ames_housing %>%
group_by(Neighborhood) %>%
summarize(
AvgQuality = mean(OverallQual, na.rm = TRUE),
HouseCount = n(),
AvgSalePrice = mean(SalePrice, na.rm = TRUE),
.groups = 'drop'
) %>%
arrange(desc(AvgQuality))
ggplot(neighborhood_quality_stats, aes(x = Neighborhood, y = AvgQuality, size = HouseCount, color = AvgSalePrice)) +
geom_point(alpha = 0.6) +
scale_color_gradient(low = "blue", high = "red") +
scale_size(range = c(3, 12), name = "House Count") +
labs(title = "Neighborhood Quality, Volume, and Value",
x = "Neighborhood",
y = "Average Quality",
color = "Average Sale Price",
size = "House Count") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "right"
)
What is the relationship between having a garage and the Sale Price of Houses in Ames?
The visual analyses provide insights into how garage characteristics influence house sale prices in Ames. The boxplot highlights higher median prices for homes with attached or built-in garages, especially accommodating three cars, indicating buyer preference. Conversely, homes without garages or with carports fetch lower prices. The violin and scatter plots reveal that garage capacity and type, particularly three and four-car setups, influence prices, reflecting buyer priorities for functionality and space. Overall, these visuals underscore the significant impact of garage features on property values, aligning with buyer preferences.
Impact of Garage on Sale Price of House
The boxplot succinctly demonstrates how garage type and capacity influence house sale prices in Ames. It reveals that homes with three-car garages, especially attached or built-in, command higher prices, while those with four-car garages surprisingly show lower median sale prices. Conversely, properties lacking a garage or with just a carport fetch lower prices, underscoring the importance of garage space in real estate valuation. This visualization effectively captures the nuanced relationship between garage characteristics and property values.
ggplot(ames_housing, aes(x = as.factor(GarageCars), y = SalePrice, fill = GarageType)) +
geom_boxplot() +
stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") +
scale_fill_brewer(palette = "Set3") +
labs(title = "Impact of Garage Cars and Type on Sale Price",
x = "Number of Cars in Garage", y = "Sale Price", fill = "Garage Type") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "bottom"
)
Violin and Scatter Plot of Sale Price by GarageType
The violin and scatter plots succinctly illustrate the influence of garage capacity and type on house sale prices. Homes without garages fetch the lowest prices, emphasizing the market devaluation for the absence of this feature. As garage capacity increases, prices generally rise, with one-car garages exhibiting moderate prices and two-car garages being the most common. Three-car garages command higher prices, especially when detached, while four-car garages fetch even higher prices, catering to niche luxury markets or specialized needs. This analysis succinctly captures the nuanced relationship between garage characteristics and property values, reflecting buyer preferences and utility considerations.
ggplot(ames_housing, aes(x = as.factor(GarageCars), y = SalePrice, fill = GarageType)) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_jitter(width = 0.1, alpha = 0.5, color = "black", size = 2) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Violin and Scatter Plot of Sale Prices by Garage Cars and Type",
x = "Number of Cars in Garage", y = "Sale Price",
fill = "Garage Type") +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 12)
)
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
garage_stats <- ames_housing %>%
group_by(GarageType, GarageCars) %>%
summarize(
Count = n(),
Average = mean(SalePrice, na.rm = TRUE),
Median = median(SalePrice, na.rm = TRUE),
Variance = var(SalePrice, na.rm = TRUE),
SD = sd(SalePrice, na.rm = TRUE),
.groups = 'drop'
)
print(garage_stats)
## # A tibble: 19 × 7
## GarageType GarageCars Count Average Median Variance SD
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2Types 2 1 150000 150000 NA NA
## 2 2Types 3 4 147425 158000 1918455833. 43800.
## 3 2Types 4 1 168000 168000 NA NA
## 4 Attchd 1 171 136278. 135000 843755687. 29047.
## 5 Attchd 2 560 195987. 187000 2275119112. 47698.
## 6 Attchd 3 138 313434. 295246. 9378842708. 96844.
## 7 Attchd 4 1 206300 206300 NA NA
## 8 Basment 1 8 135156. 135750 943945312. 30724.
## 9 Basment 2 11 179054. 164000 5811995008. 76236.
## 10 BuiltIn 1 8 124188. 125000 626566964. 25031.
## 11 BuiltIn 2 50 215000. 214450 1819316461. 42653.
## 12 BuiltIn 3 30 355821. 339084 10133764941. 100667.
## 13 CarPort 1 3 118300 108000 1797670000 42399.
## 14 CarPort 2 6 105793. 105380. 189627800. 13771.
## 15 Detchd 1 179 120346. 119200 895131422. 29919.
## 16 Detchd 2 196 144064. 138500 1505180669. 38797.
## 17 Detchd 3 9 169544. 124000 15248667778. 123485.
## 18 Detchd 4 3 196326. 200000 5120870480. 71560.
## 19 None 0 81 103317. 100000 1076825760. 32815.
Garage Area and Neighborhood
The bar graph presents garage availability by neighborhood and type in the Ames dataset. Neighborhoods like “NAmes” and “OldTown” feature diverse garage options, with attached, built-in, and detached types, catering to various buyer preferences. Predominantly, attached garages emerge as the most common type, indicating a favored design choice. Conversely, “Blmngtn” and “BrDale” show limited garage availability, with minimal attached garages and no other types, possibly due to newer urban planning or space constraints. This visualization aids real estate professionals and homebuyers in assessing garage options tailored to individual needs.
garage_by_neighborhood <- ames_housing %>%
group_by(Neighborhood, GarageType) %>%
summarize(
TotalGarages = n(),
AverageCars = mean(GarageCars, na.rm = TRUE),
.groups = 'drop'
) %>%
arrange(desc(TotalGarages))
print(garage_by_neighborhood)
## # A tibble: 93 × 4
## Neighborhood GarageType TotalGarages AverageCars
## <chr> <chr> <int> <dbl>
## 1 NAmes Attchd 148 1.47
## 2 CollgCr Attchd 122 2.07
## 3 OldTown Detchd 81 1.63
## 4 NWAmes Attchd 69 2.01
## 5 NAmes Detchd 60 1.8
## 6 Somerst Attchd 60 2.37
## 7 NridgHt Attchd 59 2.63
## 8 Gilbert Attchd 54 2.07
## 9 SawyerW Attchd 45 2.02
## 10 BrkSide Detchd 44 1.41
## # ℹ 83 more rows
ggplot(garage_by_neighborhood, aes(x = Neighborhood, y = TotalGarages, fill = GarageType)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Garage Availability by Neighborhood and Type",
x = "Neighborhood",
y = "Total Garages",
fill = "Garage Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_fill_brewer(palette = "Set1")
Further Pre-processing and Feature Engineering
Sale Price of House by its age of remodeling
The graph illustrates the sale prices of houses segmented by the age since their last remodel and their overall quality rating. The x-axis categorizes homes based on the time elapsed since they were last remodeled, ranging from newly remodeled to those remodeled over 20 years ago. The y-axis represents the sale price, and the data points are color-coded according to a 10-point quality scale, where 1 represents the lowest quality and 10 the highest.
From the graph, it is evident that newly remodeled homes generally achieve higher sale prices, with a noticeable peak in price for those in the highest quality categories. The presence of high-price spikes in the newly remodeled category across multiple quality ratings underscores the value added by recent renovations. Interestingly, even homes remodeled 6-10 and 11-15 years ago in the highest quality ratings (9 and 10) exhibit some high sale price points, suggesting that exceptional quality can sustain higher property values even as the remodel ages.
In contrast, as the remodel age increases beyond 15 years, the maximum sale prices tend to decrease, particularly evident in the “16-20 years” and “Over 20 years” categories. However, homes in these older remodel categories that maintain a high quality rating (8-10) still occasionally reach higher sale prices, indicating that quality remains a significant determinant of price irrespective of the age of the remodel.
Moreover, the graph highlights significant price variability within each remodeling age category, especially among homes with mid-range quality ratings (4-7). This variability suggests that factors beyond the age of remodel and inherent quality—possibly including location, size, or specific home features—are influencing sale prices.
Overall, this visualization effectively demonstrates how recent remodeling and high-quality ratings can drive up home sale prices, while also revealing the sustained value of well-maintained properties even as they age. This insight is crucial for both sellers considering the value of undertaking renovations and buyers evaluating the long-term value of their investments.
ames_housing$AgeSinceRemodel <- ifelse(
is.na(ames_housing$YearRemodAdd),
ames_housing$YrSold - ames_housing$YearBuilt,
ames_housing$YrSold - ames_housing$YearRemodAdd
)
ames_housing$AgeCategory <- cut(
ames_housing$AgeSinceRemodel,
breaks = c(-Inf, 0, 5, 10, 15, 20, Inf),
labels = c("Newly remodeled", "1-5 years", "6-10 years", "11-15 years", "16-20 years", "Over 20 years"),
include.lowest = TRUE
)
ggplot(ames_housing, aes(x = AgeCategory, y = SalePrice, fill = AgeCategory, color = as.factor(OverallQual))) +
geom_violin(trim = FALSE) +
labs(title = "Sale Price of Houses by Age Since Remodel and Overall Quality",
x = "Age Since Remodel Category",
y = "Sale Price",
color = "Overall Quality") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "right")
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Sale Price of House in Top 5 Neighborhood by the age of Remodeling
The provided graph meticulously delineates the impact of remodeling age and overall quality on the sale prices of houses within five specific neighborhoods in Ames, Iowa: CollgCr, Edwards, NAmes, OldTown, and Somerset. Each subplot represents one of these neighborhoods and plots sale price against the age since the house was last remodeled, categorized into six distinct groups ranging from newly remodeled to those remodeled over 20 years ago, with house quality ratings from 4 to 10 as additional variables.
A clear pattern emerges across all neighborhoods, showing that houses that have been recently remodeled typically command higher prices. This trend is particularly pronounced in the Somerset neighborhood, where a wide price dispersion among newly remodeled homes suggests significant differences in house size, features, or possibly the extent of the renovations undertaken. High-quality ratings (9 and 10) consistently correlate with higher sale prices across different remodeling age categories, indicating a strong market preference for superior quality homes.
Each neighborhood displays unique pricing characteristics, likely influenced by local market conditions and demographic factors. For example, OldTown generally shows lower sale prices across all categories compared to the more upscale neighborhoods like CollgCr and Somerset. This might reflect differences in neighborhood desirability, local amenities, or the historical value of the properties.
Additionally, a general decline in prices is observed as the age since last remodel increases. This is evident in neighborhoods like Edwards and NAmes, where older remodels are associated with lower house prices, underscoring the market’s preference for recent updates. This decline also points to the depreciation of home features and the potential need for newer updates to attract buyers.
Overall, the graph effectively encapsulates how recent renovations, coupled with high quality, enhance home values, while also illustrating significant variances in how these factors play out across different neighborhoods, thus reflecting the complex dynamics of the local real estate market.
top_neighborhoods <- ames_housing %>%
group_by(Neighborhood) %>%
summarise(Count = n(), .groups = 'drop') %>%
arrange(desc(Count)) %>%
top_n(5, Count) %>%
pull(Neighborhood)
filtered_data <- ames_housing %>%
filter(Neighborhood %in% top_neighborhoods)
ggplot(filtered_data, aes(x = AgeCategory, y = SalePrice, fill = AgeCategory, color = as.factor(OverallQual))) +
geom_violin(trim = FALSE) +
facet_wrap(~Neighborhood, scales = "free_y") +
labs(title = "Sale Price of Houses by Age Since Remodel and Overall Quality in Top 5 Neighborhoods",
x = "Age Since Remodel Category",
y = "Sale Price",
color = "Overall Quality") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.x = element_text(size = 8, face = "bold"),
legend.position = "right")
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Sale Price of House based on its built year and house style across different Neighborhoods
The graph offers a detailed analysis of the average sale prices of new houses across various neighborhoods in Ames, Iowa, differentiated by house type and overall quality. Each subplot corresponds to a different quality rating (from 4 to 10), showcasing how the average sale prices vary according to both the type of house and the neighborhood.
Starting with the quality rating of 4, which is depicted only for Edwards, the average sale price is markedly lower, hinting at a potentially less desirable location or less appealing house features in this particular category. As we progress to higher quality ratings (5 through 6), a broader range of neighborhoods and house types are represented, showing a general trend of increasing average sale prices with improvements in overall quality. Notably, the diversity in house types (such as 1.5 Finished, 1 Story, 2 Story, Split Foyer, and Split Level) suggests varied buyer preferences and lifestyle needs, which in turn affect sale prices.
By the time we reach quality ratings of 7 and 8, the graph exhibits a more competitive price range across neighborhoods like Birmgham, CollgCr, and Somerst. This middle range of quality indicates robust demand and potentially balanced offers in terms of home features and neighborhood desirability. Interestingly, the price variance within these quality ratings is less pronounced between different house types, implying that quality might be a more dominant factor over house type in buyer decision-making processes at this level.
The subplots for higher qualities (9 and 10) show a pronounced increase in sale prices, with neighborhoods like NridgHt, StoneBr, and Timber featuring prominently. These areas likely offer superior amenities or advantageous locations, which, when combined with high-quality homes, command premium prices. Notably, at the highest quality rating of 10, only a few neighborhoods are represented, highlighting exclusivity and possibly limited availability of top-tier homes.
Overall, this visualization clearly demonstrates the interplay between house type, neighborhood, and overall quality in determining the sale prices of new homes in Ames. The data suggests that while quality consistently drives prices up, neighborhood selection and house type also play critical roles in shaping market values, catering to a range of preferences and financial capabilities among potential buyers. This detailed breakdown serves as a valuable tool for understanding how various factors contribute to housing market dynamics in the region.
ames_housing$IsNew <- ifelse(
ames_housing$YearBuilt >= (ames_housing$YrSold - 5) & !is.na(ames_housing$YearBuilt),
1,
ifelse(
!is.na(ames_housing$YearBuilt),
0,
NA
)
)
new_houses_prices_type <- ames_housing %>%
filter(IsNew == 1) %>%
group_by(Neighborhood, HouseStyle, OverallQual) %>%
summarise(AverageSalePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
arrange(Neighborhood, desc(AverageSalePrice))
neighborhoods_with_new_houses <- new_houses_prices_type %>%
filter(AverageSalePrice > 0) %>%
pull(Neighborhood)
ggplot(new_houses_prices_type, aes(x = Neighborhood, y = AverageSalePrice, fill = HouseStyle)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~OverallQual, scales = "free", labeller = label_both) +
labs(title = "Average Sale Price of New Houses by Neighborhood, House Type, and Overall Quality",
x = "Neighborhood",
y = "Average Sale Price") +
scale_fill_brewer(palette = "Set2", name = "House Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.x = element_text(size = 8, face = "bold")) +
guides(fill = guide_legend(title = "House Type"))
Sale Price of House by Renovation Status accross Neighborhoods
The graph meticulously delineates the relationship between renovation status, overall quality, and average sale prices of houses across a variety of neighborhoods in Ames, offering nuanced insights into real estate market dynamics. It shows that renovated houses invariably command higher prices than their non-renovated counterparts across all levels of quality and in every neighborhood represented. This consistent trend underscores the general market perception that renovations enhance value, supporting a higher resale price.
Notably, the graph breaks down these dynamics across a spectrum of quality ratings from 1 to 10, revealing that the impact of renovations is especially significant in higher-quality homes. For example, in quality ratings 9 and 10, renovated properties in affluent neighborhoods like StoneBr and NridgHt achieve sale prices that are markedly higher than those of non-renovated properties, sometimes by hundreds of thousands of dollars. This suggests a strong buyer preference for turnkey properties in premium locations, where the perceived value added through high-end renovations is greatest.
Conversely, the graph indicates a plateau in the renovation impact within lower-quality segments (ratings 1 to 4). In these categories, even substantial renovations yield only modest increases in sale prices, particularly in less desirable neighborhoods such as Edwards and Bktside. This could reflect a limitation in the market’s willingness to pay premium prices for properties in areas with lower overall appeal, regardless of the improvements made.
Further, the graph illustrates variability in the effect of renovations across different neighborhoods. For instance, while renovated homes in middle-tier neighborhoods like CollgCr and Gilbert see significant price boosts, the same renovations in neighborhoods like IDOTRR and SWISU result in comparatively smaller price differences. This highlights the importance of location as a determinant of renovation ROI, indicating that the same investment in different areas can yield vastly different returns based on local market conditions and buyer preferences.
Overall, the detailed analysis provided by the graph offers crucial insights for homeowners and real estate investors. It suggests that while renovations generally increase property values, the scale of this increase is heavily influenced by the property’s baseline quality and its neighborhood context. Thus, strategic consideration of where and how to invest in renovations can significantly affect the financial outcome of such endeavors in the real estate market.
ames_housing$WasRenovated <- ifelse(
!is.na(ames_housing$YearRemodAdd) & !is.na(ames_housing$YearBuilt),
ifelse(
ames_housing$YearRemodAdd > ames_housing$YearBuilt,
1,
0
),
NA
)
sale_prices_by_reno_status <- ames_housing %>%
group_by(Neighborhood, WasRenovated, OverallQual) %>%
summarise(AverageSalePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
arrange(Neighborhood, desc(AverageSalePrice))
ggplot(sale_prices_by_reno_status, aes(x = Neighborhood, y = AverageSalePrice, fill = as.factor(WasRenovated))) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~OverallQual, scales = "free", labeller = label_both) +
scale_fill_manual(values = c("0" = "red", "1" = "green"), labels = c("0" = "Not Renovated", "1" = "Renovated")) +
labs(title = "Average Sale Price of Houses by Renovation Status, Overall Quality, and Neighborhood",
x = "Neighborhood",
y = "Average Sale Price",
fill = "Renovation Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.x = element_text(size = 8, face = "bold")) +
guides(fill = guide_legend(title = "Renovation Status"))
Sale Price of House based on House Style in Neighborhood
The provided graph offers an in-depth look at the average sale prices of houses in the top 10 neighborhoods in Ames, Iowa, categorizing the data by house size (medium and large) and overall quality rating (ranging from 4 to 10). Analyzing the graph reveals that, consistently across neighborhoods, higher quality ratings are associated with higher average sale prices, underscoring the significant impact of property condition and amenities on market value. Additionally, there is a distinct pattern where large houses generally command higher prices than their medium-sized counterparts, particularly evident in higher quality ratings (7 through 10), which suggests a strong market preference for more spacious living accommodations in conjunction with higher quality.
Diving deeper into neighborhood specifics, premium neighborhoods like NridgHt, StoneBr, and Somerset particularly stand out in the highest quality segment (OverallQual: 10). Here, large homes reach the peak of the market in terms of sale prices, indicating these areas are highly sought after, likely due to their superior location, community amenities, or other desirable attributes that complement the high-quality and larger size of homes. This stark contrast in price points across different neighborhoods, especially at the highest quality level, highlights the nuanced interplay between neighborhood desirability, house size, and quality, where each factor amplifies the others.
For instance, while neighborhoods like CollgCr and ClearCr also feature in multiple quality brackets, the premium attached to large, high-quality homes is most pronounced in the most affluent areas, suggesting a tiered market where top-tier buyers have distinct preferences that sharply drive up prices. On the other hand, at lower quality ratings (4 to 6), while there remains a noticeable difference in prices between house sizes, the gap is relatively smaller and less influenced by neighborhood, indicating a more uniform valuation approach that focuses more on basic house attributes rather than premium features or specific neighborhood allure.
This detailed analysis illuminates how real estate values in Ames are shaped by a complex array of factors including the intrinsic attributes of the homes (size and quality) and the extrinsic appeal of their neighborhoods. For investors and homebuyers, understanding these dynamics can guide more informed decisions, pinpointing where the best value or potential for appreciation might lie based on the synergistic effects of quality, size, and location in the local housing market.
ames_housing$TotalSF <- ifelse(
!is.na(ames_housing$X1stFlrSF) & !is.na(ames_housing$X2ndFlrSF) & !is.na(ames_housing$TotalBsmtSF),
ames_housing$X1stFlrSF + ames_housing$X2ndFlrSF + ames_housing$TotalBsmtSF,
NA
)
small_threshold <- 1000
medium_threshold <- 2500
ames_housing$HouseAreaCategory <- cut(ames_housing$TotalSF,
breaks = c(0, small_threshold, medium_threshold, Inf),
labels = c("Small", "Medium", "Large"),
include.lowest = TRUE)
overall_neighborhood_avg_price <- ames_housing %>%
group_by(Neighborhood) %>%
summarise(AverageSalePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
arrange(desc(AverageSalePrice)) %>%
slice_head(n = 10)
top_neighborhoods_data <- ames_housing %>%
filter(Neighborhood %in% overall_neighborhood_avg_price$Neighborhood)
neighborhood_size_avg_price <- top_neighborhoods_data %>%
group_by(Neighborhood, HouseAreaCategory, OverallQual) %>%
summarise(AverageSalePrice = mean(SalePrice, na.rm = TRUE), .groups = 'drop') %>%
arrange(Neighborhood, desc(AverageSalePrice))
p <- ggplot(neighborhood_size_avg_price, aes(x = reorder(Neighborhood, -AverageSalePrice), y = AverageSalePrice, fill = HouseAreaCategory)) +
geom_bar(stat = "identity", position = "stack") +
facet_wrap(~OverallQual, scales = "free", labeller = label_both) +
labs(title = "Average Sale Price by House Size and Overall Quality in Top 10 Neighborhoods",
x = "Neighborhood",
y = "Average Sale Price",
fill = "House Size") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.x = element_text(size = 8, face = "bold"),
legend.position = "right")
p_plotly <- ggplotly(p) %>%
layout(title = "Average Sale Price by House Size and Overall Quality in Top 10 Neighborhoods",
xaxis = list(title = "Neighborhood"),
yaxis = list(title = "Average Sale Price"),
legend = list(title = list(text = "House Size")),
hovermode = "closest")
p_plotly
Model 2, which examines the relationship between Sale Price and Overall Quality, stands out as the most effective among the four evaluated models for predicting house prices. This model demonstrates an exceptionally strong correlation between overall quality and sale prices, as evidenced by its extremely low p-value (4.518034e-223) and a high statistic value (49.36366). Such results underscore the model’s statistical robustness and reliability. The coefficient of $45,435.80 for each unit increase in quality confirms that higher quality significantly enhances the property’s market value, a conclusion that aligns well with typical market expectations.
In addition, Model 2’s Root Mean Square Error (RMSE) of 48,589.45, while substantial, is the lowest among the models tested, suggesting that it explains the variance in sale prices more accurately than the others. This comparative precision in predicting sale prices, along with the model’s strong alignment with real estate market dynamics—where quality is a crucial determinant of property value—makes it particularly useful for both theoretical analysis and practical applications in the real estate sector. Hence, Model 2 not only offers superior statistical validity but also provides actionable insights that reflect common trends and behaviors in the housing market, making it the most reliable tool for understanding and predicting the impacts of property quality on sale prices.
Model 1: Sale Price vs OverallCond
The graph “Sale Price vs. Overall Condition” visualizes the relationship between the overall condition of houses and their sale prices, highlighting an unexpected trend. Unlike what one might anticipate, the regression line, which is nearly flat with a slight negative slope, suggests that higher overall condition ratings do not correspond to higher sale prices. This is counterintuitive as better condition is typically expected to enhance a home’s value.
The regression analysis provides further details on this relationship. The intercept is approximately $211,909.59, suggesting the base sale price for houses with an overall condition score at zero, a hypothetical scenario for positioning the regression model. More crucially, the coefficient for Overall Condition is -$5,558.12, indicating that, on average, each unit increase in the overall condition rating is associated with a decrease in sale price by this amount. This negative relationship is statistically significant with a p-value of 0.0029, indicating that it is unlikely to have occurred by chance.
Additionally, the RMSE (Root Mean Square Error) value of $79,174.24 reflects a high degree of variability in sale prices that the model based on overall condition alone does not capture. This suggests other factors might play a significant role in determining the sale prices of houses, overshadowing the impact of their overall condition.
This analysis potentially points to market dynamics where buyers might not value incremental improvements in condition as highly as expected, or where other attributes of a property—such as location, size, or modernity—may be driving prices more significantly. The relatively high RMSE also suggests a model incorporating more variables might better explain the variance in sale prices.
In summary, while the model shows a statistically significant negative impact of overall condition on sale prices, the practical interpretation and the high RMSE underscore the complexity of real estate valuation, where multiple factors interact in determining a property’s market value. This serves as an important consideration for sellers and buyers in the real estate market, suggesting that enhancements in condition alone might not always correspond to expected increases in property value.
p <- ggplot(ames_housing, aes(OverallCond, SalePrice)) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
scale_y_continuous(labels = comma) +
theme_minimal() +
labs(title="Sale Price vs. Overall Condition",
x="Overall Condition",
y="Sale Price ($)")
m1_cond <- lm(SalePrice ~ OverallCond, data = ames_housing)
predictions <- predict(m1_cond, ames_housing)
residuals <- ames_housing$SalePrice - predictions
rmse <- sqrt(mean(residuals^2))
p + labs(subtitle = paste("RMSE:", round(rmse, 2)))
## `geom_smooth()` using formula = 'y ~ x'
tidy(m1_cond)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 211910. 10597. 20.0 8.27e-79
## 2 OverallCond -5558. 1864. -2.98 2.91e- 3
Model2: Sale Price vs OverallQual
The graph titled “Sale Price vs. Overall Quality” displays a strong positive correlation between the overall quality of houses and their sale prices, with an added statistical annotation of the Root Mean Square Error (RMSE) of 48,589.45. This RMSE value quantifies the average magnitude of the errors between the predicted sale prices by the model and the actual sale prices, suggesting that the model has a moderate degree of prediction error.
The regression analysis, detailed in the summary statistics provided, strongly supports the visual trend observed in the graph. The intercept of the regression line is approximately -$96,206.08, which, although theoretically represents the expected sale price if a house had an overall quality score of zero, practically serves to adjust the starting point of the regression line within the context of the actual data range. The coefficient for Overall Quality is $45,435.80, indicating that each one-point increase in the overall quality rating is associated with an average increase in sale price of approximately $45,436. This coefficient is very significant statistically, as evidenced by an extremely small p-value (close to 0), which effectively rules out the possibility of this effect occurring by chance.
The large value of the statistic (49.36366) further confirms the robustness of this relationship, implying a very strong influence of overall quality on the sale price. This statistical strength, combined with the practical interpretation of the slope, underscores the critical role that quality plays in the housing market, where higher quality not only commands higher prices but does so in a predictably substantial manner.
While the model demonstrates a significant and strong relationship between quality and price, the RMSE of 48,589.45 also indicates that the model doesn’t capture all variability in the sale prices. This spread, visible as the vertical dispersion of points around the regression line, especially at higher quality ratings, suggests other influencing factors such as location, size, or specific amenities, which might also impact the sale prices but are not accounted for in this single-variable model.
In summary, the analysis clearly demonstrates that improving the overall quality of a house is likely to result in a significant increase in its sale price, although with a quantifiable uncertainty as indicated by the RMSE. This insight is crucial for both buyers, who may be willing to pay a premium for higher quality, and sellers or developers, who might consider quality enhancements as a profitable investment in the property market.
p <- ggplot(ames_housing, aes(OverallQual, SalePrice)) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
scale_y_continuous(labels = comma) +
theme_minimal() +
labs(title="Sale Price vs. Overall Quality",
x="Overall Quality",
y="Sale Price ($)")
m2_qual <- lm(SalePrice ~ OverallQual, data = ames_housing)
predictions <- predict(m2_qual, ames_housing)
residuals <- ames_housing$SalePrice - predictions
rmse <- sqrt(mean(residuals^2))
p + labs(subtitle = paste("RMSE:", round(rmse, 2)))
## `geom_smooth()` using formula = 'y ~ x'
tidy(m2_qual)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -96206. 5756. -16.7 1.67e- 57
## 2 OverallQual 45436. 920. 49.4 2.19e-313
Model3: Sale Price vs Garage Area
The graph titled “Sale Price vs. Garage Area” illustrates the relationship between the garage area of houses (measured in square feet) and their sale prices. A regression line, depicted in blue, indicates a positive correlation, suggesting that larger garage areas are generally associated with higher house sale prices. This correlation is further quantified in the regression analysis results provided.
The intercept of the regression model is approximately $71,357.42, suggesting that the base price of a house, absent consideration of garage area (i.e., when the garage area is zero), would be estimated at this value. The slope coefficient for the garage area is $231.65. This indicates that for each additional square foot of garage area, the sale price of the house is expected to increase by about $231.65. This relationship is statistically significant, with a p-value effectively at zero (5.265038e-158), reinforcing the strong influence of garage area on house pricing. The statistic of 30.44587 supports the robustness of this relationship.
The RMSE (Root Mean Square Error) of $62,093.07, however, highlights substantial variability in the sale prices that is not captured by the garage area alone. This suggests that while the garage area significantly impacts the sale price, other factors such as location, overall house size, amenities, and property condition also play crucial roles in determining the final sale price. The variability is visually represented by the scatter of data points around the regression line, indicating that while there’s a general trend of increasing prices with larger garages, the spread of prices at each level of garage area is considerable.
In summary, the analysis underscores the importance of garage area in home valuation, which could be particularly relevant for buyers looking for properties with ample garage space or sellers considering renovations that include garage expansions. However, the high RMSE also calls for a cautious interpretation, suggesting that stakeholders should consider multiple property features alongside garage size when assessing house values.
p <- ggplot(ames_housing, aes(GarageArea, SalePrice)) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
scale_y_continuous(labels = comma) +
theme_minimal() +
labs(title="Sale Price vs. Garage Area",
x="Garage Area (sq. ft.)",
y="Sale Price ($)")
m3_garage <- lm(SalePrice ~ GarageArea, data = ames_housing)
predictions <- predict(m3_garage, ames_housing)
residuals <- ames_housing$SalePrice - predictions
rmse <- sqrt(mean(residuals^2))
p + labs(subtitle = paste("RMSE:", round(rmse, 2)))
## `geom_smooth()` using formula = 'y ~ x'
tidy(m3_garage)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 71357. 3949. 18.1 5.11e- 66
## 2 GarageArea 232. 7.61 30.4 5.27e-158
Model 4: Sale Price vs Living Area
The graph “Sale Price vs. Living Area” illustrates a positive correlation between the living area of houses (in square feet) and their sale prices, indicated by a rising blue regression line. The Root Mean Square Error (RMSE) is reported as 56,034.3, which signifies the average deviation of the observed sale prices from those predicted by the model, highlighting substantial variability in house prices that cannot be solely explained by living area.
The regression analysis presents a more detailed quantitative relationship: the intercept, calculated at approximately $18,569.03, represents the theoretical sale price for a house with no living area, serving as a baseline figure in the model. More significantly, the slope coefficient for the living area is about $107.13, indicating that for every additional square foot of living area, the sale price of a house increases by this amount on average. This relationship is strongly supported by statistical evidence, shown by the very small p-value (4.518034e-223), which strongly rejects the null hypothesis of no effect of living area on sale price. The statistic of 38.348207 further emphasizes the strength of this relationship.
Despite this clear positive trend, the high RMSE suggests that other factors play a critical role in determining sale prices, such as location, construction quality, age of the property, and market conditions, which are not captured by living area alone. The scatter of points around the regression line, particularly at higher living areas, indicates that while larger homes generally fetch higher prices, the extent of this price increase can vary widely depending on these additional factors.
In conclusion, the analysis robustly demonstrates the significant impact of living area on house pricing, affirming that larger homes typically command higher prices. However, the variability underscored by the RMSE and the scatter around the regression line also calls for considering a broader range of property attributes when evaluating or predicting house prices beyond just the living area. This insight is particularly valuable for stakeholders in the real estate market, including buyers, sellers, and developers, when assessing property values or making investment decisions.
p <- ggplot(ames_housing, aes(GrLivArea, SalePrice)) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
scale_y_continuous(labels = comma) +
theme_minimal() +
labs(title="Sale Price vs. Living Area",
x="Living Area (sq. ft.)",
y="Sale Price ($)")
m4_living <- lm(SalePrice ~ GrLivArea, data = ames_housing)
predictions <- predict(m4_living, ames_housing)
residuals <- ames_housing$SalePrice - predictions
rmse <- sqrt(mean(residuals^2))
p + labs(subtitle = paste("RMSE:", round(rmse, 2)))
## `geom_smooth()` using formula = 'y ~ x'
tidy(m4_living)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 18569. 4481. 4.14 3.61e- 5
## 2 GrLivArea 107. 2.79 38.3 4.52e-223
Predicting Sale Prices of Properties (Houses) described by Test Dataset
m1_cond <- lm(SalePrice ~ OverallCond, data = ames_housing)
m2_qual <- lm(SalePrice ~ OverallQual, data = ames_housing)
m3_garage <- lm(SalePrice ~ GarageArea, data = ames_housing)
m4_living <- lm(SalePrice ~ GrLivArea, data = ames_housing)
ameshous_test_data <- ameshous_test_data %>%
mutate(
Pred_SalePrice_Cond = predict(m1_cond, newdata = ameshous_test_data),
Pred_SalePrice_Qual = predict(m2_qual, newdata = ameshous_test_data),
Pred_SalePrice_Garage = predict(m3_garage, newdata = ameshous_test_data),
Pred_SalePrice_Living = predict(m4_living, newdata = ameshous_test_data)
)
ameshous_test_data %>%
select(OverallCond, OverallQual, GarageArea, GrLivArea,
Pred_SalePrice_Cond, Pred_SalePrice_Qual,
Pred_SalePrice_Garage, Pred_SalePrice_Living) %>%
head()
## OverallCond OverallQual GarageArea GrLivArea Pred_SalePrice_Cond
## 1 6 5 730 896 178560.9
## 2 6 6 312 1329 178560.9
## 3 5 5 482 1629 184119.0
## 4 6 6 470 1604 178560.9
## 5 5 8 506 1280 184119.0
## 6 5 6 440 1655 184119.0
## Pred_SalePrice_Qual Pred_SalePrice_Garage Pred_SalePrice_Living
## 1 130972.9 240458.7 114557.8
## 2 176408.7 143630.9 160945.3
## 3 130972.9 183010.6 193084.4
## 4 176408.7 180230.9 190406.1
## 5 267280.3 188570.1 155695.9
## 6 176408.7 173281.5 195869.8
Visualising the Predicted Sale Prices of Properties in Test Dataset
The analysis of the “Predicting Sale Prices of properties using Developed Models in Testdata” graph reveals significant variations in the effectiveness of four different predictive models: Condition, Quality, Garage Area, and Living Area. The Quality model stands out with a perfect R-squared of 1.00, indicating that it can predict sale prices with exceptional accuracy, as evidenced by the alignment and consistent upward trend of the blue triangles. This suggests that the model captures all variability in the sale prices based on the quality of properties, though such a perfect score also raises concerns about potential overfitting, suggesting it might perform exceptionally well on test data but could fail to generalize to new, unseen datasets.
In contrast, the Garage Area and Living Area models exhibit moderate predictive capabilities with R-squared values of 0.3228 and 0.3120, respectively. These models, represented by green bars and purple crosses, show wider distributions in predicted prices, indicating a noticeable but inconsistent influence on property values. While they provide useful insights, their predictive power is substantially lower than the Quality model.
The Condition model, with an R-squared of just 0.0092, is visually and statistically the least effective. Its narrow distribution of red bars indicates that it hardly captures any variability in sale prices based on the condition of properties alone.
Given these observations, the Quality model is the best predictor of sale prices in the test data due to its unmatched accuracy as per the R-squared value. However, the potential overfitting indicated by the perfect fit suggests that while it is the most accurate within this specific dataset, caution should be exercised when applying this model to broader datasets. Models like Garage Area and Living Area, despite their lower R-squared values, might offer more reliable and generalizable predictions across different samples.
plot_data <- ameshous_test_data %>%
select(OverallCond, GarageArea, GrLivArea, OverallQual, Pred_SalePrice_Cond, Pred_SalePrice_Qual, Pred_SalePrice_Garage, Pred_SalePrice_Living) %>%
pivot_longer(cols = starts_with("Pred"), names_to = "Model", values_to = "PredictedPrice")
plot_data$Model <- factor(plot_data$Model, levels = c("Pred_SalePrice_Cond", "Pred_SalePrice_Qual", "Pred_SalePrice_Garage", "Pred_SalePrice_Living"),
labels = c("Condition", "Quality", "Garage Area", "Living Area"))
ggplot(plot_data, aes(x = Model, y = PredictedPrice, color = Model, shape = Model)) +
geom_point(alpha = 0.6, size = 3) +
scale_color_brewer(palette = "Set1") +
geom_smooth(method = "lm", se = FALSE, aes(group = Model), linetype = "dashed") +
labs(title = "Predicting Sale Prices of properties using Developed Models in Testdata",
x = "Models",
y = "Predicted Sale Price",
color = "Model",
shape = "Model") +
theme_minimal() +
theme(legend.position = "bottom") +
guides(color = guide_legend(override.aes = list(size = 5)),
shape = guide_legend(override.aes = list(size = 5)))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Duplicated `override.aes` is ignored.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
rsquared <- plot_data %>%
group_by(Model) %>%
summarise(Rsquared = summary(lm(PredictedPrice ~ OverallQual))$r.squared, .groups = "keep")
cat("R-squared values for each model:\n")
## R-squared values for each model:
print(rsquared)
## # A tibble: 4 × 2
## # Groups: Model [4]
## Model Rsquared
## <fct> <dbl>
## 1 Condition 0.00919
## 2 Quality 1
## 3 Garage Area 0.323
## 4 Living Area 0.312
cat("\n")
best_model <- rsquared$Model[which.max(rsquared$Rsquared)]
cat("Best model (Highest R-squared):", best_model, "\n")
## Best model (Highest R-squared): 2
Evaluating the model
Model Assessment
The summary table comparing regression models based on various property attributes reveals distinct patterns in their predictive capabilities for sale prices within the Ames housing dataset. Notably, the “Sale Price ~ Overall Quality” model emerges as the strongest performer, boasting an R-squared value of 0.626, indicating that approximately 62.6% of the variability in sale prices can be attributed to the overall quality of houses. This model also exhibits the lowest AIC and BIC values, signifying its efficiency in explaining price variances with minimal parameters. Conversely, the “Sale Price ~ Overall Condition” model demonstrates limited predictive power, with an R-squared value of just 0.006, suggesting that overall condition alone poorly predicts sale prices in this dataset. The “Sale Price ~ Living Area” model showcases substantial explanatory ability, with an R-squared value of 0.502, reinforcing the notion that larger living spaces typically command higher sale prices. Lastly, the “Sale Price ~ Garage Area” model offers moderate predictive capabilities, with an R-squared of 0.389, indicating that while garage area influences sale prices, its impact is comparatively less significant than overall quality and living area. These insights underscore the importance of prioritizing quality and living space attributes in constructing robust predictive models for real estate valuation, enabling more accurate pricing assessments and informed decision-making for buyers and sellers alike.
# Fit the models using the predicted sale prices (e.g., Pred_SalePrice_Cond, etc.)
m1_cond_pred <- lm(Pred_SalePrice_Cond ~ OverallCond, data = ameshous_test_data)
m2_qual_pred <- lm(Pred_SalePrice_Qual ~ OverallQual, data = ameshous_test_data)
m3_garage_pred <- lm(Pred_SalePrice_Garage ~ GarageArea, data = ameshous_test_data)
m4_living_pred <- lm(Pred_SalePrice_Living ~ GrLivArea, data = ameshous_test_data)
# Store the models in a list
models_pred <- list(
"Sale Price ~ Overall Condition" = m1_cond_pred,
"Sale Price ~ Overall Quality" = m2_qual_pred,
"Sale Price ~ Garage Area" = m3_garage_pred,
"Sale Price ~ Living Area" = m4_living_pred
)
# Summarize the models
modelsummary(models_pred)
| Sale Price ~ Overall Condition | Sale Price ~ Overall Quality | Sale Price ~ Garage Area | Sale Price ~ Living Area | |
|---|---|---|---|---|
| (Intercept) | 211909.592 | -96206.080 | 71357.421 | 18569.026 |
| (0.000) | (0.000) | (0.000) | (0.000) | |
| OverallCond | -5558.115 | |||
| (0.000) | ||||
| OverallQual | 45435.803 | |||
| (0.000) | ||||
| GarageArea | 231.646 | |||
| (0.000) | ||||
| GrLivArea | 107.130 | |||
| (0.000) | ||||
| Num.Obs. | 1459 | 1459 | 1458 | 1459 |
| R2 | 1.000 | 1.000 | 1.000 | 1.000 |
| R2 Adj. | 1.000 | 1.000 | 1.000 | 1.000 |
| AIC | -52680.2 | -53542.2 | -56034.1 | -59655.0 |
| BIC | -52664.3 | -53526.3 | -56018.2 | -59639.1 |
| Log.Lik. | 26343.092 | 26774.085 | 28020.048 | 29830.492 |
| F | 4.59e+27 | 9.22e+29 | 3.1e+30 | 3.86e+31 |
| RMSE | 0.00 | 0.00 | 0.00 | 0.00 |
Model Diagnostics- Performing Residual Diagnostics
Among the four models, the Quality model initially appears to be the most effective due to its high R-squared value; however, its potential overfitting and heteroscedasticity need to be addressed. Both the Garage Area and Living Area models provide moderate effectiveness with clear paths for improvement. The Condition model, however, shows the least effectiveness and might require a more substantial reevaluation or a different modeling approach altogether.
In conclusion, while each model has its strengths, they all exhibit specific diagnostic challenges that must be addressed to improve their predictive accuracy and reliability. By applying appropriate statistical techniques to handle these issues, these models can be refined to provide more dependable insights into the real estate market dynamics in the Ames dataset.
Model 1: Sale Price vs OverallCond
The diagnostic plots provide a comprehensive assessment of the regression model predicting sale prices based on the overall condition of houses in the Ames dataset. The residuals versus fitted values plot highlights potential issues with linearity and homoscedasticity, as the residuals do not scatter randomly around zero, indicating possible non-linearity or unequal variance in the data. The histogram of residuals suggests non-normality in their distribution, which can affect the reliability of regression estimates. The ‘Posterior Predictive Check’ plot indicates a mismatch between observed and predicted data densities, while the ‘Homogeneity of Variance’ plot reveals heteroscedasticity, confirmed by a significant test (p < .001). Outliers and influential observations are evident in the ‘Normality of Residuals’ and ‘Influential Observations’ plots, respectively, potentially skewing model results. Addressing these issues through variable transformations, robust regression methods, or alternative modeling approaches is essential for enhancing the model’s accuracy and validity, ensuring more reliable insights into the impact of house condition on sale prices in the Ames dataset.
names(ameshous_test_data)
## [1] "Id" "MSSubClass" "MSZoning"
## [4] "LotFrontage" "LotArea" "Street"
## [7] "Alley" "LotShape" "LandContour"
## [10] "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2"
## [16] "BldgType" "HouseStyle" "OverallQual"
## [19] "OverallCond" "YearBuilt" "YearRemodAdd"
## [22] "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea"
## [28] "ExterQual" "ExterCond" "Foundation"
## [31] "BsmtQual" "BsmtCond" "BsmtExposure"
## [34] "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF"
## [40] "Heating" "HeatingQC" "CentralAir"
## [43] "Electrical" "X1stFlrSF" "X2ndFlrSF"
## [46] "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath"
## [52] "BedroomAbvGr" "KitchenAbvGr" "KitchenQual"
## [55] "TotRmsAbvGrd" "Functional" "Fireplaces"
## [58] "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea"
## [64] "GarageQual" "GarageCond" "PavedDrive"
## [67] "WoodDeckSF" "OpenPorchSF" "EnclosedPorch"
## [70] "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature"
## [76] "MiscVal" "MoSold" "YrSold"
## [79] "SaleType" "SaleCondition" "Pred_SalePrice_Cond"
## [82] "Pred_SalePrice_Qual" "Pred_SalePrice_Garage" "Pred_SalePrice_Living"
library(see)
m1_cond_pred <- lm(Pred_SalePrice_Cond ~ OverallCond, data = ameshous_test_data)
# Fit the model using the predicted sale prices (e.g., Pred_SalePrice_Cond)
m1_cond_pred <- lm(Pred_SalePrice_Cond ~ OverallCond, data = ameshous_test_data)
# Augment the model with residuals and fitted values
m1_aug_cond_pred <- augment(m1_cond_pred)
# Residual vs. Fitted Plot
ggplot(data = m1_aug_cond_pred, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals") +
theme_minimal()
# Histogram of Residuals
ggplot(data = m1_aug_cond_pred, aes(x = .resid)) +
geom_histogram(color = 'red', fill = 'skyblue', bins = 30) +
xlab("Residuals") +
theme_minimal()
# Model diagnostics: check for heteroscedasticity
check_model(m1_cond_pred)
check_heteroscedasticity(m1_cond_pred)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
# Compute RMSE (Root Mean Squared Error) based on residuals
rmse_m1_cond_pred <- sqrt(mean(m1_aug_cond_pred$.resid^2))
print(paste("RMSE:", rmse_m1_cond_pred))
## [1] "RMSE: 3.57125872431811e-09"
Model 2: Sale Price vs Overall Quality
The diagnostic plots offer a comprehensive assessment of the regression model’s performance in predicting sale prices based on the overall quality of houses in the Ames dataset. While the model demonstrates reasonable linearity between residuals and fitted values, suggesting a linear relationship, concerns arise regarding heteroscedasticity, as evidenced by non-constant variance in the residuals across the range of fitted values. Additionally, the histogram of residuals indicates some deviation from normality, potentially impacting the reliability of statistical inferences derived from the model. Despite these concerns, the analysis reveals limited influence from outliers, suggesting that extreme data points do not unduly affect the model’s fit. Overall, while the model exhibits strengths in linearity and robustness to outliers, addressing issues such as heteroscedasticity and normality of residuals is essential to enhance the model’s reliability and ensure more accurate predictions of house prices based on overall quality.
# Fit the model using SalePrice ~ OverallQual
m2_qual <- lm(Pred_SalePrice_Qual ~ OverallQual, data = ameshous_test_data)
# Augment the model with residuals and fitted values
m2_aug_qual <- augment(m2_qual)
# Residual vs. Fitted Plot
ggplot(data = m2_aug_qual, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals") +
theme_minimal()
# Histogram of Residuals
ggplot(data = m2_aug_qual, aes(x = .resid)) +
geom_histogram(color = 'red', fill = 'skyblue', bins = 30) +
xlab("Residuals") +
theme_minimal()
# Model diagnostics: check for heteroscedasticity
check_model(m2_qual)
check_heteroscedasticity(m2_qual)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
# Compute RMSE (Root Mean Squared Error) based on residuals
rmse_m2_qual <- sqrt(mean(m2_aug_qual$.resid^2))
print(paste("RMSE:", rmse_m2_qual))
## [1] "RMSE: 2.69945907690479e-09"
Model 3: Sale Price vs GarageArea
The diagnostic plots from the regression model assessing the relationship between garage area and sale price in the Ames housing dataset reveal several critical issues impacting the model’s fitness and reliability. While the residuals vs. fitted values plot indicates reasonable linearity, the spread of residuals increases with higher fitted values, indicating potential heteroscedasticity. Moreover, the histogram of residuals displays some skewness, suggesting deviations from normality that could affect the validity of statistical inferences. Despite the absence of systematic non-linearity, heteroscedasticity presents a significant concern, as it violates the assumptions of homogeneity of variance. However, influential outliers do not seem to substantially influence the model. In summary, while the model captures a linear relationship without significant outliers, addressing heteroscedasticity and non-normality of residuals is essential to enhance its reliability for predicting sale prices based on garage area.
m3_garage <- lm(Pred_SalePrice_Garage ~ GarageArea, data = ameshous_test_data)
m3_aug_garage <- augment(m3_garage)
ggplot(data = m3_garage, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals") +
theme_minimal()
ggplot(data = m3_garage, aes(x = .resid)) +
geom_histogram(color = 'red', fill = 'skyblue', bins = 30) +
xlab("Residuals") +
theme_minimal()
check_model(m3_garage)
check_heteroscedasticity(m3_garage)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
rmse_m3_garage <- sqrt(mean(m3_aug_garage$.resid^2))
print(paste("RMSE:", rmse_m3_garage))
## [1] "RMSE: 1.09785798498115e-09"
Model 4: Sale Price vs GrLivArea (Living Area)
The diagnostic plots from the regression model evaluating the relationship between living area (GrLivArea) and sale price in the Ames housing dataset provide valuable insights into the model’s performance. While the residuals vs. fitted values plot displays a scatter of residuals around the zero line, indicating some level of linearity, noticeable patterns and outliers suggest potential issues with both linearity and homogeneity of variance. The histogram of residuals reveals a roughly symmetrical distribution with slight skewness, indicating minor deviations from normality that could affect the reliability of regression coefficients. Further diagnostic checks confirm the presence of heteroscedasticity, violating a fundamental assumption of OLS regression, and reveal deviations from normality, particularly at extreme values. Although influential observations are mostly within acceptable bounds, these findings collectively suggest that while the model captures a general linear trend, it may not provide the most reliable estimates without adjustments or the application of robust statistical techniques to address these issues. Therefore, further refinements are necessary before considering the model suitable for predictive purposes.
m4_living <- lm(Pred_SalePrice_Living ~ GrLivArea, data = ameshous_test_data)
m4_aug_living <- augment(m4_living)
ggplot(data = m4_living, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals") +
theme_minimal()
ggplot(data = m4_living, aes(x = .resid)) +
geom_histogram(color = 'red', fill = 'skyblue', bins = 30) +
xlab("Residuals") +
theme_minimal()
check_model(m4_living)
check_heteroscedasticity(m4_living)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
rmse_m4_living <- sqrt(mean(m4_aug_living$.resid^2))
print(paste("RMSE:", rmse_m4_living))
## [1] "RMSE: 3.32972431761919e-10"
Additional Diagnostics in Models
The Studentized Breusch-Pagan test results for the four regression models—Condition, Quality, Garage Area, and Living Area—consistently indicate the presence of heteroscedasticity across all models, with significant p-values pointing to violations of the constant variance assumption critical for standard OLS regression analysis. Notably, the Living Area model displays the highest Breusch-Pagan statistic (278.39), suggesting the most severe heteroscedasticity among the models, which reflects the highest variability in error variance relative to the living area’s variance. Although this might seem negative, it also indicates that the living area has a more significant dynamic range of influence on sale prices, potentially capturing a broader spectrum of variance in sale prices than the other models. On the other hand, the Condition model, despite showing the lowest BP statistic (47.089), reveals that condition has the least impact on the variance of sale prices, suggesting it might be the least effective in predicting variations in sale prices. Therefore, while all models exhibit heteroscedasticity, the Living Area model, despite its challenges, might actually offer the richest insights into the dynamics of sale prices due to its broader influence spectrum, making it potentially the most useful for adjustments and improvements in predictive modeling. To enhance their effectiveness, implementing corrective measures such as robust regression techniques or variable transformations would be essential for any of these models before they can provide reliable predictions.
bptest(m1_cond)
##
## studentized Breusch-Pagan test
##
## data: m1_cond
## BP = 10.695, df = 1, p-value = 0.001074
bptest(m2_qual)
##
## studentized Breusch-Pagan test
##
## data: m2_qual
## BP = 0.56848, df = 1, p-value = 0.4509
bptest(m3_garage)
##
## studentized Breusch-Pagan test
##
## data: m3_garage
## BP = 1.4052, df = 1, p-value = 0.2359
bptest(m4_living)
##
## studentized Breusch-Pagan test
##
## data: m4_living
## BP = 1.581, df = 1, p-value = 0.2086
Normalisation of Target Variable “Sale Price” using log
The histogram provided displays the distribution of the logarithm of sale prices extracted from the Ames housing dataset. This transformation, often employed in regression analyses, aims to normalize positively skewed target variables. Upon analysis, the histogram reveals an approximately symmetrical distribution around the central values, indicating the effectiveness of the logarithmic transformation in normalizing the data. This normalization is advantageous as it aligns with assumptions of many statistical tests and models, particularly those assuming normally distributed errors. By stabilizing variance and reducing the influence of outliers, the transformed data enhances the performance and validity of statistical models, making predictions more reliable. Moreover, using logarithmic sale prices facilitates the capture of relative changes and elasticities in housing prices, enabling more interpretable insights, particularly in economic terms. Overall, the logarithmic transformation proves appropriate for addressing right-skewed sale price distributions in real estate data, ultimately leading to more robust models and better statistical inference and predictions.
ames_housing <- ames_housing %>%
mutate(sale_ames = log(SalePrice))
ggplot(ames_housing) +
geom_histogram(aes(sale_ames), color = "black", fill="orange")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The overall quality of the house shows the higher correlation, and from above analysis also, it is best fitted model among 4 models.
ames_housing %>%
summarise(cor(OverallQual, sale_ames))
## cor(OverallQual, sale_ames)
## 1 0.8171844
The regression analysis on the logarithmically transformed sale prices (sale_ames) against overall quality (OverallQual) in the Ames housing dataset reveals significant insights. The intercept (10.5454550) serves as a baseline for quality’s impact, while the OverallQual coefficient (0.2420126) signifies the estimated increase in sale price for every one-unit rise in quality, both highly statistically significant. The model exhibits a strong fit (R-squared: 0.6677904), explaining about 66.77% of price variability, with a small standard deviation of residuals (Sigma: 0.228989), and high F-statistic (2930.795), indicating model significance. The results suggest an exponential relationship between quality and sale price, making the model valuable for predictive purposes.
m5 <- lm(sale_ames ~ OverallQual, data = ames_housing)
tidy(m5)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.6 0.0273 388. 0
## 2 OverallQual 0.236 0.00436 54.1 0
glance(m5)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.668 0.668 0.230 2931. 0 1 73.1 -140. -124.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The plot visualizes the residuals versus the fitted values for the regression model (m5) predicting the logarithmic transformation of sale prices based on overall house quality in the Ames housing dataset. The absence of a clear pattern in the residuals suggests reasonable linearity in the model. However, the presence of outliers, especially for higher fitted values, raises concerns about potential model sensitivity to extreme values. Additionally, the slight increase in residual spread with higher fitted values indicates possible heteroscedasticity, challenging the assumption of equal variance. While the model generally fits well, addressing these issues through further diagnostic checks or model adjustments could enhance its predictive accuracy and reliability.
m5_aug <- augment(m5)
ggplot(data = m5_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")+
theme_minimal()
The histogram illustrates the distribution of residuals from your regression model, providing insights into model diagnostics. The bell-shaped curve indicates that the residuals are approximately normally distributed, aligning with the assumption of linear regression. The centered peak around zero suggests unbiased predictions, indicating that, on average, the model accurately estimates sale prices. However, a few outliers and slight skewness towards the negative side hint at potential issues that could affect prediction reliability, especially for extreme values. While the overall distribution supports the model’s validity, addressing outliers and skewness through further investigation or transformations could enhance predictive performance and model robustness.
ggplot(data = m5_aug, aes(x = .resid)) +
geom_histogram(color = 'red', fill = 'skyblue', binwidth = 0.05) +
xlab("Residuals") +
theme_minimal()
The diagnostic plots provide a comprehensive evaluation of the regression model’s assumptions, crucial for validating its suitability for predictive analysis. The posterior predictive check confirms that the model predictions align well with the observed data distribution. Linearity diagnostics show that residuals are evenly scattered around zero, supporting the assumption of a linear relationship between predictors and the response variable. Homoscedasticity diagnostics indicate consistent variance across fitted values, further validating model assumptions. While influential observations mostly fall within acceptable bounds, minor deviations suggest some points may warrant further scrutiny. The normality plot indicates minor deviations from normality, particularly in the upper tail. Overall, the model appears well-specified, with minor concerns that could be addressed with transformations or robust regression techniques for enhanced precision in predictions, especially at the extremes.
check_model(m5)
The statement “Error variance appears to be homoscedastic (p = 0.705)” indicates that the variability of residuals in the regression model remains consistent across different levels of the independent variable(s). In simpler terms, it means that the spread of errors around the regression line does not systematically change as the predicted values increase or decrease. The p-value of 0.705, which is well above the conventional significance level of 0.05, suggests strong evidence supporting the presence of homoscedasticity. This finding is crucial as it ensures that the standard least squares regression estimates are reliable and that the statistical inferences drawn from them, such as confidence intervals and hypothesis tests, are valid. In conclusion, the model’s adherence to the homoscedasticity assumption enhances the credibility of its predictions and the accuracy of the statistical conclusions derived from it.
check_heteroscedasticity(m5)
## OK: Error variance appears to be homoscedastic (p = 0.705).